At Navan, we’re committed to creating the best experience for business travelers, ensuring that our systems are always reliable, scalable, and efficient. As we continue to grow, we’re looking for a Site Reliability Engineering (SRE) Manager to join our team in headquarters based out of Palo Alto, California. In this role, you will lead a team of SREs, drive innovation in infrastructure design and automation, and ensure our systems run seamlessly at scale, serving thousands of travelers every day.

What You’ll Do

Lead & Mentor the SRE Team: Guide and develop a high-performing team of SREs, fostering a culture of collaboration, reliability, and continuous improvement.
Drive Infrastructure Reliability & Automation: Collaborate with Engineering and Product teams to design and implement scalable, fault-tolerant systems. Leverage IaC tools (e.g., Terraform, CloudFormation) and microservices architectures to automate and improve infrastructure.
Incident Management: Improve incident response processes, reduce MTTR, and proactively mitigate risks. Apply resiliency patterns to ensure systems are fault-tolerant and highly available.
Define & Measure SLOs: Develop service-level objectives (SLOs) and KPIs to track and improve system reliability, using tools like NewRelic or DataDog for observability.
24x7 Production Support: Ensure system availability in a 24x7 environment, applying expertise in AWS (e.g., ECS, Lambda, DynamoDB) and database management for optimal performance.
Optimize CI/CD Pipelines: Automate and streamline deployment workflows using tools like Jenkins or GitHub Actions to ensure faster and more reliable deployments.
Resource Management: Manage team resources, including capacity planning, hiring, and upskilling, to meet evolving business needs.

What We’re Looking For

8+ years in Site Reliability Engineering, DevOps, or Infrastructure roles, with at least 3 years in a leadership position.
Proven ability to lead and mentor teams, fostering a culture of collaboration and reliability.
Hands-on experience with AWS cloud technologies, Infrastructure as Code (Terraform/CloudFormation), microservices architectures, deployment automation (Jenkins/GitHub Actions), and observability tools (NewRelic/DataDog).
Strong background in designing scalable, fault-tolerant systems, improving incident response, and driving operational improvements.
Excellent interpersonal and communication skills, with the ability to work effectively across cross-functional teams.

Manager, Site Reliability Engineering

Job description

What You’ll Do

What We’re Looking For

Explore more

Career resources

Similar jobs

Senior Platform Engineer

Sr. Talent Acquisition Partner

Founding Engineer - DJ

Senior Backend Engineer

Senior Product Analyst

Senior Backend Engineer – Identity & Security Platform