Job Summary

Synechron is seeking a Site Reliability Engineer (SRE) to improve the reliability, scalability, and performance of cloud-native systems. This role supports production operations through AWS infrastructure management, containerized workload operations, CI/CD enablement, observability, and incident response. The position contributes to business goals by improving availability, reducing operational risk, and supporting cost-efficient system performance.

Software Requirements

Required

AWS: strong hands-on experience with EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch
Docker
Container orchestration using EKS/Kubernetes or ECS
CI/CD using GitHub Actions, Jenkins, or Azure DevOps
IaC using Terraform or CloudFormation
Observability tools: CloudWatch, Prometheus/Grafana, ELK/OpenSearch, X-Ray
Automation using Python and/or Bash
Linux system administration and troubleshooting
Networking knowledge covering DNS, TCP/IP, TLS, security groups, NACLs

Preferred

Experience with CloudFront, RDS, ElastiCache, ASG
Blue/green and canary deployment strategies
Artifact management and release approval workflows
Vulnerability scanning and secrets management tools

Overall Responsibilities

Define and maintain SLOs, SLIs, SLAs, and error budgets
Build and manage AWS infrastructure for scalable, highly available systems
Operate containerized services using Docker and ECS/EKS/Kubernetes
Implement and optimize CI/CD pipelines and deployment strategies
Establish observability through metrics, logs, and traces
Automate infrastructure and operations using IaC and scripting
Manage incident response, runbooks, root-cause analysis, and remediation
Drive performance tuning, capacity planning, and cost optimization
Implement security best practices across infrastructure and deployments
Partner with development teams to improve reliability by design

Technical Skills (By Category)

Programming Languages

Essential: Python, Bash
Preferred: Scripting for operational automation and diagnostics

Databases / Data Management

Essential: Operational familiarity with RDS and ElastiCache in production environments
Preferred: Performance tuning and availability planning for managed data services

Cloud Technologies

Essential: AWS including EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch
Preferred: CloudFront, Auto Scaling Groups, advanced cost optimization practices

Frameworks and Libraries

Essential: Docker, Kubernetes/EKS or ECS
Preferred: Reliability patterns such as circuit breakers, retries, backoff, health checks

Development Tools and Methodologies

Essential: CI/CD, Terraform or CloudFormation, monitoring and alerting, incident response, Linux troubleshooting
Preferred: Blue/green and canary deployments, release engineering improvements

Security Protocols

Essential: Least-privilege IAM, SSL/TLS, secrets handling, vulnerability awareness
Preferred: Automated scanning, policy enforcement, and remediation workflows

Experience Requirements

7+ years of experience in SRE, DevOps, or Cloud Operations
Experience owning production infrastructure and reliability outcomes
Strong experience with AWS, Docker, orchestration, CI/CD, IaC, and incident response
Experience improving MTTR, availability, and operational efficiency
Equivalent experience in related production engineering roles will also be considered

Day-to-Day Activities

Maintain AWS environments and containerized services
Monitor system health, alerts, logs, and traces
Improve deployment pipelines and release reliability
Participate in incident response, troubleshooting, and postmortems
Update runbooks, dashboards, and automation scripts
Work with Dev, QA, and Security teams on resilience and operational readiness
Join standups, planning sessions, reviews, and reliability discussions

Qualifications

Required

Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field
or equivalent practical experience

Preferred

AWS, Kubernetes, Terraform, or cloud operations certifications
Ongoing learning in reliability engineering, security, and performance optimization

Professional Competencies

Strong analytical and problem-solving skills
Clear communication and effective documentation
Collaboration across engineering, QA, and security teams
Ability to prioritize operational work and planned improvements
Adaptability in production and incident-driven environments
Focus on reliability, efficiency, and continuous improvement

SYNECHRON’S DIVERSITY & INCLUSION STATEMENT

Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.

All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.

Candidate Application Notice

Site Reliability Engineer (SRE) – AWS + Docker

Job description

Explore more

Career resources

Career resources