Ifm-us
Research Scientist - Vision Language Model
Company
Role
Research Scientist - Vision Language Model
Location
US
Job type
Full-time
Found on Mokaru
3 weeks ago
Salary
Job description
Position Summary As a Research Scientist in the Vision Language Model (VLM) team, your role will be central to advancing state-of-the-art multimodal foundation models that integrate visual understanding, reasoning, and agentic capabilities. You will work on the research and development of large-scale VLM systems, spanning model architectures, data recipes for pre-training and post-training, and evaluation benchmarks. The role combines cutting-edge research with practical engineering, emphasizing large-scale data processing, filtering, and weighting pipelines, distributed training systems, and reinforcement learning algorithms and systems for multimodal reasoning and agent development.
Key Responsibilities •
Research and development of next-generation Vision Language Models across pre-training, instruction tuning, reasoning, and agents.
•
Develop novel architectures and training methodologies for integrating visual understanding, language reasoning, and tool-use capabilities.
•
Research efficient multimodal learning techniques, including data-efficient training, long-context modeling, model modularity, and inference optimization.
•
Build and improve large-scale multimodal datasets, synthetic data generation pipelines, and evaluation benchmarks for VLM capabilities.
•
Investigate multimodal reasoning, agentic behavior, OCR, grounding, document understanding, chart understanding, and visual question answering capabilities.
•
Contribute to technical reports, research publications, and open-source software.
•
Represent MBZUAI at research conferences and industry events, showcasing advancements in multimodal foundation models and large-scale AI systems.
•
Mentor junior researchers and collaborate across teams to drive impactful research initiatives.
Academic Qualifications PhD or equivalent research experience in Machine Learning, Computer Vision, Natural Language Processing, or Multimodal AI.


