Tripactions

Tripactions

Manager, Site Reliability Engineering

Role

Manager, Site Reliability Engineering

Job type

-

Posted

14 hours ago

Share this job

Salary

Not disclosed by employer

Job description

At Navan, we’re committed to creating the best experience for business travelers, ensuring that our systems are always reliable, scalable, and efficient. As we continue to grow, we’re looking for a Site Reliability Engineering (SRE) Manager to join our team in headquarters based out of Palo Alto, California. In this role, you will lead a team of SREs, drive innovation in infrastructure design and automation, and ensure our systems run seamlessly at scale, serving thousands of travelers every day.

What You’ll Do

  • Lead & Mentor the SRE Team: Guide and develop a high-performing team of SREs, fostering a culture of collaboration, reliability, and continuous improvement.
  • Drive Infrastructure Reliability & Automation: Collaborate with Engineering and Product teams to design and implement scalable, fault-tolerant systems. Leverage IaC tools (e.g., Terraform, CloudFormation) and microservices architectures to automate and improve infrastructure.
  • Incident Management: Improve incident response processes, reduce MTTR, and proactively mitigate risks. Apply resiliency patterns to ensure systems are fault-tolerant and highly available.
  • Define & Measure SLOs: Develop service-level objectives (SLOs) and KPIs to track and improve system reliability, using tools like NewRelic or DataDog for observability.
  • 24x7 Production Support: Ensure system availability in a 24x7 environment, applying expertise in AWS (e.g., ECS, Lambda, DynamoDB) and database management for optimal performance.
  • Optimize CI/CD Pipelines: Automate and streamline deployment workflows using tools like Jenkins or GitHub Actions to ensure faster and more reliable deployments.
  • Resource Management: Manage team resources, including capacity planning, hiring, and upskilling, to meet evolving business needs.

What We’re Looking For

  • 8+ years in Site Reliability Engineering, DevOps, or Infrastructure roles, with at least 3 years in a leadership position.
  • Proven ability to lead and mentor teams, fostering a culture of collaboration and reliability.
  • Hands-on experience with AWS cloud technologies, Infrastructure as Code (Terraform/CloudFormation), microservices architectures, deployment automation (Jenkins/GitHub Actions), and observability tools (NewRelic/DataDog).
  • Strong background in designing scalable, fault-tolerant systems, improving incident response, and driving operational improvements.
  • Excellent interpersonal and communication skills, with the ability to work effectively across cross-functional teams.
Resume ExampleCover Letter Example

Explore more