Xdof
Member of Technical Staff, Vision / Language
Company
Role
Member of Technical Staff, Vision / Language
Job type
Full-time
Found on Mokaru
1 week ago
Salary
Job description
ABOUT XDOF
Frontier labs are racing to build general-purpose robots, and the bottleneck isn't compute. It's data. At XDOF, we're building the foundation behind the foundation models: the data collection systems, annotation pipelines, exabyte-scale data infrastructure, and software toolchain that enable our partners to push the field forward.
We're hiring a Research Engineer / Scientist to help lead technical efforts at the intersection of vision-language models and robot learning. You will build systems that turn raw egocentric and teleoperation video into high-signal training data for VLA models, and increasingly, contribute to the models themselves.
Beyond pipelines, you will drive research into what makes robot data useful: discovering new metadata (contact events, affordance labels, implicit reward signals, dynamics priors from video) that unlock capabilities current approaches miss. You'll explore how structured annotations can improve cross-embodiment transfer, automatic curriculum generation, and world models that predict what actually matters for manipulation. The data layer isn't downstream of the research. It is the research.
WHAT YOU'LL DO
- Design and implement vision-language pipelines for egocentric and teleoperation video: structured captioning, temporal grounding, action-conditioned scene understanding, and semantic annotation at scale
- Develop and evaluate representations that bridge visual perception, language, and low-level robot action — spanning VLAs, video prediction, and world models
- Build and improve data curation systems that assess quality, diversity, and coverage of large-scale robot demonstration datasets
- Work hands-on with bimanual and high-DoF manipulation data, including real teleoperation footage and sim-generated rollouts
- Collaborate directly with partner labs to define data requirements and close the loop between data quality and downstream policy performance
- Stay current on the research frontier (VLAs, video foundation models, flow matching, DiT architectures, egocentric pretraining) and translate insights into production systems
REQUIRED
- MS or PhD in Computer Science, Robotics, Machine Learning, or a related field from a top-tier program
- 3–7 years of research or applied research experience (industry or academic) in one or more of: vision-language models, video understanding, robot learning, or generative modeling
- Deep fluency in PyTorch; working knowledge of large-scale training infrastructure (distributed training, mixed precision, large batch workflows)
- Published work or demonstrable impact in VLMs/VLAs, video representation learning, imitation learning, or a closely related area
- Strong engineering fundamentals — you can design clean systems, not just run experiments
BENEFITS
- Competitive compensation and equity
- Comprehensive health and wellness benefits
- Flexible work arrangements
- Collaborative and fast-paced work environment
- Opportunity to shape the future of robotics and AI alongside an ambitious, values-driven team
Level: Mid Level to Senior Research Scientist (L4–L5 equivalent) Location: San Mateo
Note: Junior candidates will still be considered
If you’re excited to help build the infrastructure powering tomorrow’s intelligent machines, we’d love to hear from you!


