Accountabilities: In this role, you will own the end-to-end delivery and reliability ecosystem, building platforms and practices that enable fast, safe, and scalable software delivery across engineering teams.

Design, build, and evolve CI/CD pipelines, deployment automation, and release frameworks that enable continuous and on-demand production delivery
Define and enforce engineering standards for progressive delivery, rollback strategies, quality gates, and deployment safety mechanisms
Build and manage self-service environments (dev, staging, and ephemeral) that replicate production and accelerate development cycles
Drive AI-augmented DevOps practices, including automated runbooks, intelligent alerting, and AI-assisted incident response workflows
Champion Infrastructure as Code and GitOps practices to ensure scalable, repeatable, and secure infrastructure and deployments
Own operational reliability practices including observability, incident response, SLO/SLI definition, and on-call readiness
Partner directly with engineering teams in an embedded model to improve delivery maturity and operational excellence
Track and improve engineering performance using DORA metrics and other reliability indicators

Requirements

The ideal candidate brings deep DevOps and platform engineering expertise, combined with strong hands-on experience in modern infrastructure and AI-enabled operations.

7+ years of experience in DevOps, platform engineering, SRE, or infrastructure-focused roles in high-scale environments
Strong hands-on experience with Kubernetes and AWS in production systems
Deep expertise in Infrastructure as Code tools such as Terraform and/or CloudFormation
Proven experience designing and operating CI/CD pipelines with strong governance, automation, and quality controls
Experience implementing GitOps workflows using tools such as Argo CD or Flux
Hands-on experience operating high-scale systems including Kafka and distributed data infrastructure
Strong software engineering and automation skills using Python, Bash, or similar languages
Experience with observability tooling such as Prometheus, Grafana, PagerDuty, and related monitoring stacks
Practical experience with incident management, on-call rotations, and reliability engineering best practices
Demonstrated experience integrating AI tools or agentic workflows into DevOps or SRE processes
Strong communication skills with the ability to influence, mentor, and collaborate across engineering teams

Benefits

Competitive base salary with performance-based annual bonus
Equity opportunities for eligible roles
Fully remote work within Canada
Comprehensive health, dental, and vision coverage
Generous paid time off and flexible work arrangements
Learning and development support, including courses and training programs
Parental leave and family support benefits
Opportunity to work on high-impact systems in a fast-scaling engineering environment
Strong culture of ownership, autonomy, and technical excellence

How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

AI DevOps & Reliability Engineer

Job description

Explore more

Career resources

Career resources