Ifm-us
Research Scientist, Agentic Data & Benchmarking
Company
Role
Research Scientist, Agentic Data & Benchmarking
Location
US
Job type
Full-time
Found on Mokaru
2 weeks ago
Salary
Job description
Key responsibilities Benchmarking & evaluation
•
Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics.
•
Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health.
•
Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks.
•
Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly.
Agentic data
•
Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities.
•
Design and scale RL environments and reward signals, and measure their impact on model performance.
•
Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback.
•
Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high.
Across both
•
Contribute to technical reports, research publications, and open-source benchmarks and tooling.
•
Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts.
Qualifications Academic qualifications
•
BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field.
Minimum qualifications
•
2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems).
•
Strong Python and PyTorch development experience.
•
Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets — ideally both.
•
Hands-on experience using LLM agents in your personal or professional work.
•
A habit of reading through raw data and trajectories to understand them and spot issues, and an instinct to distrust a metric until it's validated.
Preferred qualifications
•
Experience with reinforcement learning, reward design, or RL environment construction for LLMs.
•
Background in statistics and experimental design — a feel for signal-to-noise, statistical power, and contamination in evaluations.
•
Experience with large-scale dataset sourcing, curation, and processing, including working with external vendors or domain experts.
•
Strong knowledge of the literature on agent evaluation, RL, LLM reasoning, and tool use.
•
Experience building or operating data pipelines and evaluation infrastructure reliable at scale (e.g., PyTorch, Ray).
•
Experience evaluating or generating data for software-engineering or computer-use agents.
•
Contributions to published research, public benchmarks, and/or open-source ML software.
Representative projects •
Stand up a new agentic benchmark from scratch — define the task, build the dataset and scoring, validate against known signals, and ship a view that makes the result legible to researchers and leadership.
•
Build an RL environment for a new high-value capability: design the reward, generate and QA the trajectory data, and measure the lift on model performance.
•
Diagnose a mid-training regression: an eval suite returns anomalous numbers and you determine whether it's the model, the harness, the data, or the infrastructure.
•
Partner with an external data vendor or domain expert to source high-quality trajectories, then build the QA framework that keeps reward hacking and contamination out.
•
Take a flaky distributed eval pipeline and make it reliable — better retries, better observability, faster feedback to researchers.


