Ifm-us
HPC Engineer
Salary
Job description
Responsibilities
- Monitor health, performance, and availability of large-scale GPU clusters.
- Respond to incidents and perform first-level triage.
- Support researchers and troubleshoot job failures.
- Execute operational runbooks and recovery procedures.
- Validate cluster deployments, upgrades, and maintenance activities.
- Track infrastructure utilization and operational metrics.
- Develop automation and monitoring tools.
- Contribute to documentation and reporting.
Education Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.
Experience
- 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
- Strong Linux troubleshooting skills.
- Experience with scripting using Python or Bash.
Preferred Qualifications
- Slurm.
- GPU infrastructure.
- AWS, Azure, or GCP.
- Grafana, Prometheus, Datadog, or similar tools.
- Containers and Kubernetes.
- AI/ML infrastructure exposure.
- Research computing environments.


