Jobgether
AI DevOps & Reliability Engineer
Job description
Accountabilities: In this role, you will own the end-to-end delivery and reliability ecosystem, building platforms and practices that enable fast, safe, and scalable software delivery across engineering teams.
- Design, build, and evolve CI/CD pipelines, deployment automation, and release frameworks that enable continuous and on-demand production delivery
- Define and enforce engineering standards for progressive delivery, rollback strategies, quality gates, and deployment safety mechanisms
- Build and manage self-service environments (dev, staging, and ephemeral) that replicate production and accelerate development cycles
- Drive AI-augmented DevOps practices, including automated runbooks, intelligent alerting, and AI-assisted incident response workflows
- Champion Infrastructure as Code and GitOps practices to ensure scalable, repeatable, and secure infrastructure and deployments
- Own operational reliability practices including observability, incident response, SLO/SLI definition, and on-call readiness
- Partner directly with engineering teams in an embedded model to improve delivery maturity and operational excellence
- Track and improve engineering performance using DORA metrics and other reliability indicators
Requirements
The ideal candidate brings deep DevOps and platform engineering expertise, combined with strong hands-on experience in modern infrastructure and AI-enabled operations.
- 7+ years of experience in DevOps, platform engineering, SRE, or infrastructure-focused roles in high-scale environments
- Strong hands-on experience with Kubernetes and AWS in production systems
- Deep expertise in Infrastructure as Code tools such as Terraform and/or CloudFormation
- Proven experience designing and operating CI/CD pipelines with strong governance, automation, and quality controls
- Experience implementing GitOps workflows using tools such as Argo CD or Flux
- Hands-on experience operating high-scale systems including Kafka and distributed data infrastructure
- Strong software engineering and automation skills using Python, Bash, or similar languages
- Experience with observability tooling such as Prometheus, Grafana, PagerDuty, and related monitoring stacks
- Practical experience with incident management, on-call rotations, and reliability engineering best practices
- Demonstrated experience integrating AI tools or agentic workflows into DevOps or SRE processes
- Strong communication skills with the ability to influence, mentor, and collaborate across engineering teams
Benefits
- Competitive base salary with performance-based annual bonus
- Equity opportunities for eligible roles
- Fully remote work within Canada
- Comprehensive health, dental, and vision coverage
- Generous paid time off and flexible work arrangements
- Learning and development support, including courses and training programs
- Parental leave and family support benefits
- Opportunity to work on high-impact systems in a fast-scaling engineering environment
- Strong culture of ownership, autonomy, and technical excellence
How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1


