ABOUT XDOF

Frontier labs are racing to build general-purpose robots, and the bottleneck isn't compute. It's data. At XDOF, we're building the foundation behind the foundation models: the data collection systems, annotation pipelines, exabyte-scale data infrastructure, and software toolchain that enable our partners to push the field forward.

We're hiring a Research Engineer / Scientist to help lead technical efforts at the intersection of vision-language models and robot learning. You will build systems that turn raw egocentric and teleoperation video into high-signal training data for VLA models, and increasingly, contribute to the models themselves.

Beyond pipelines, you will drive research into what makes robot data useful: discovering new metadata (contact events, affordance labels, implicit reward signals, dynamics priors from video) that unlock capabilities current approaches miss. You'll explore how structured annotations can improve cross-embodiment transfer, automatic curriculum generation, and world models that predict what actually matters for manipulation. The data layer isn't downstream of the research. It is the research.

WHAT YOU'LL DO

Design and implement vision-language pipelines for egocentric and teleoperation video: structured captioning, temporal grounding, action-conditioned scene understanding, and semantic annotation at scale
Develop and evaluate representations that bridge visual perception, language, and low-level robot action — spanning VLAs, video prediction, and world models
Build and improve data curation systems that assess quality, diversity, and coverage of large-scale robot demonstration datasets
Work hands-on with bimanual and high-DoF manipulation data, including real teleoperation footage and sim-generated rollouts
Collaborate directly with partner labs to define data requirements and close the loop between data quality and downstream policy performance
Stay current on the research frontier (VLAs, video foundation models, flow matching, DiT architectures, egocentric pretraining) and translate insights into production systems

REQUIRED

MS or PhD in Computer Science, Robotics, Machine Learning, or a related field from a top-tier program
3–7 years of research or applied research experience (industry or academic) in one or more of: vision-language models, video understanding, robot learning, or generative modeling
Deep fluency in PyTorch; working knowledge of large-scale training infrastructure (distributed training, mixed precision, large batch workflows)
Published work or demonstrable impact in VLMs/VLAs, video representation learning, imitation learning, or a closely related area
Strong engineering fundamentals — you can design clean systems, not just run experiments

BENEFITS

Competitive compensation and equity
Comprehensive health and wellness benefits
Flexible work arrangements
Collaborative and fast-paced work environment
Opportunity to shape the future of robotics and AI alongside an ambitious, values-driven team

Level: Mid Level to Senior Research Scientist (L4–L5 equivalent) Location: San Mateo

Note: Junior candidates will still be considered

If you’re excited to help build the infrastructure powering tomorrow’s intelligent machines, we’d love to hear from you!

Member of Technical Staff, Vision / Language

Job description

Explore more

Career resources

Similar jobs

Assistant Manager(07764) - 876 Geary Street

AP/AR Specialist

Tonga Room Host

Senior User Research Analyst - Port of San Francisco (9976)

Modernization Engineering Project Lead - Port of San Francisco (9976)

Member of Technical Staff, Perception