Nvidia

Nvidia

Manager, Software Engineering - AIOps

Company

Nvidia

Role

Manager, Software Engineering - AIOps

Location

Israel

Job type

Full time

🔥

Posted

2 hours ago

Share this job

Salary

Not disclosed by employer

Job description

NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.

This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.

What You’ll Be Doing:

  • Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.

  • Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.

  • Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.

  • Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.

  • Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.

  • Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.

What We Need to See:

  • BS/MS degree in Computer Science or a related technical field (or equivalent experience).

  • 8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.

  • Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.

  • Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.

  • Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.

  • Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.

  • Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.

Ways to Stand Out from the Crowd:

  • AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.

  • Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).

  • Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.

  • SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.

Resume ExampleCover Letter Example

Explore more