Forbes Advisor
SRE Manager
Salary
Job description
WHAT YOU’LL DO:
- Lead and manage production & non-production support ensuring high availability and system reliability
- Drive SRE best practices including incident management, root cause analysis, and continuous improvement Assume ownership of major incidents and drive coordinating efforts to ensure quick resolution of impacting events.
- Collaborate with SRE team members for design and development of observability practices like Dashboarding, Logging, Metrics, Tracing, etc. They aim to diagnose and troubleshoot issues proactively.
- Collaborate with SRE team members to define Service Level Objectives (SLO) and agreements (SLA) of critical systems. They also monitor and maintain the uptime of these systems in-line with the defined SLOs and SLAs.
- Identify and remove blockers, escalate appropriately, and continuous momentum of troubleshooting efforts.
- Ensure adherence to established incident management processes and protocols.
- Contribute to the improvement of incident response runbooks and documentation.
- Own internal and external communications during major incidents.
- Translate technical details into business-impact language (scope, severity, risk, ETA, confidence level).
- Maintain clear and continuous communication with stakeholders during incidents, providing timely updates.
- Ensure safe execution of mitigations, rollbacks, feature flags, and failovers
- Lead post incident review meetings with stakeholders to confirm event details and assign problem investigators.
- Track and report on incident metrics, identifying patterns and areas for systemic improvement.
- Augment Change Managers and / or Problem Managers as required in the performance of those responsibilities.
WHAT YOU’VE DONE:
- Bachelor’s or master’s Degree and/or equivalent experience relevant to functional area.
- 12+ years of experience in SRE / DevOps
- 5+ years of working experience as a Site Reliability Engineer
- Experience managing critical incidents in a 24/7 production environment.
- Experience with ServiceNow ITSM and on‑call incident coordination via PagerDuty / Zen duty (or comparable ITSM/on‑call tools).
Knowledge, Skills, Abilities & Behaviours
- Understand a wide breadth of technical concepts across SRE practices
- Background in cloud-based systems and SRE practices is a must.
- Experience in at-least one Observability platform like New Relic, Datadog, etc. preferred.
- Ability to use AI tools to synthesize communication, reports, and troubleshooting leads.
- Certification in AWS, ITIL, or related frameworks preferred.
- Experience in SaaS or technology product companies preferred.
- Strong leadership and decision-making skills under pressure.
- Excellent verbal and written communication skills for both technical and non-technical audiences.
- Ability to manage multiple priorities and deadlines in high-stakes situations.
- Strong analytical skills to drive root cause analysis and trend identification.
- Familiarity with modern monitoring and incident management tools.
- Demonstrated ability to build consensus across diverse teams.
- Effective at maintaining calm and focus during critical situations.
- Knowledge of cloud infrastructure (e.g., AWS, Azure) and application architecture.
- Proven track record of improving incident management processes.
- Attention to detail in documentation and follow-through.
- Adept at facilitating collaboration across remote and global teams.
- Proactive in identifying operational risks and implementing preventive measures.
- Committed to continuous learning and process improvement.
- Ethical, dependable, and resilient in challenging scenarios.
● Day off on the 3rd Friday of every month (one long weekend each month)
● Monthly Wellness Reimbursement Program to promote health well-being
● Monthly Office Commutation Reimbursement Program
● Paid paternity and maternity leaves


