Key responsibilities Benchmarking & evaluation

•

Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics.

•

Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health.

•

Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks.

•

Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly.

Agentic data

•

Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities.

•

Design and scale RL environments and reward signals, and measure their impact on model performance.

•

Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback.

•

Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high.

Across both

•

Contribute to technical reports, research publications, and open-source benchmarks and tooling.

•

Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.

Qualifications Academic qualifications

•

BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field.

Minimum qualifications

•

2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems).

•

Strong Python and PyTorch development experience.

•

Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both.

•

Hands-on experience using LLM agents in your personal or professional work.

•

A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated.

Preferred qualifications

•

Experience with reinforcement learning, reward design, or RL environment construction for LLMs.

•

Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations.

•

Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts.

•

Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use.

•

Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray).

•

Experience evaluating or generating data for software-engineering or computer-use agents.

•

Contributions to published research, public benchmarks, and/or open-source ML software.

Representative projects •

Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership.

•

Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance.

•

Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure.

•

Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out.

•

Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.

Research Scientist, Agentic Data & Benchmarking

Job description

Explore more

Career resources

Career resources