Job Title: MLOps Engineer (PyTorch)

Location: Singapore

Job Type: Full-time

About the Opportunity

Our client is seeking an MLOps Engineer with a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advanced PyTorch -based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on the quality and robustness of the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities

Architect, build, and maintain end-to-end training and inference pipelines using PyTorch .
Develop and maintain high-quality, robust tooling in both Python and C++ to support the entire model training lifecycle.
Take full ownership of the core training codebase , enforcing best practices for clarity, modularity, reproducibility , and performance.
Design and implement workflows for checkpointing , resuming jobs, model versioning, and experiment tracking.
Proactively optimize compute workloads for bare-metal environments, focusing on I/O bottlenecks, CPU/GPU utilization , and memory efficiency.
Troubleshoot and debug complex, low-level issues , including networking bottlenecks , distributed training errors (e.g., NCCL ), and hardware faults.
Configure and manage all ML environments , including containers , package management, GPU drivers , and runtime configurations.
Monitor and debug large-scale training jobs running across multiple nodes and GPUs.

Required Qualifications (You Should Have)

Deep, expert-level knowledge of PyTorch , including DDP (DistributedDataParallel), mixed precision training, and TorchScript .
Advanced programming skills in both C++ and Python .
A solid background in computer science fundamentals (data structures, algorithms, concurrency , operating systems).
Hands-on experience debugging and tuning bare-metal servers , including Linux administration, kernel parameter tuning , and BIOS tuning .
A strong understanding of low-level networking (e.g., RoCE, InfiniBand), interconnects, and distributed training protocols like NCCL and MPI .
A proven track record of building reliable, reproducible pipelines for both model training and evaluation.
Experience with job schedulers (e.g., SLURM , or custom runners) and cluster monitoring tools.

Preferred Qualifications (Nice-to-Have)

Experience with non-standard deployments, such as on-premise local clusters or edge devices (i.e., not public cloud).
Active contributions to PyTorch or other open-source ML/HPC tools.
Familiarity with Infrastructure-as-Code (IaC) tools like Ansible , Terraform , or Nix .
Experience building out a full logging, observability, and alerting stack for training workloads.

How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managing PyTorch workloads on bare-metal infrastructure.

MLOps Engineer (PyTorch)

Job description

Career resources

Similar jobs

Senior Base Metals Operator

Regional Total Rewards Analyst, APJ

Product Controller

Project Admin

Project Admin

JR-2301906 Nurse Occupational Health

Career resources

Similar jobs

Senior Base Metals Operator

Regional Total Rewards Analyst, APJ

Product Controller

Project Admin

Project Admin

JR-2301906 Nurse Occupational Health