What you will do: Reliability & Observability

Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
Build visibility into system health through metrics, logs, traces, and performance analytics.
Define and manage SLIs, SLOs, and service reliability targets.
Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.

Cloud Infrastructure & Platform Operations

Security & Access Management

Implement secure access controls and audit mechanisms across infrastructure environments.
Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
Develop alerting and response procedures for security-related incidents.
Contribute to operational security best practices and governance initiatives.

Automation & Engineering Excellence

Automate repetitive operational tasks to reduce manual effort and improve reliability.
Build tooling and scripts to streamline infrastructure operations.
Support CI/CD workflows and deployment automation.
Promote documentation, operational standards, and continuous improvement.

Incident Response & Reliability Engineering

Cross-Functional Collaboration

Work closely with software, AI, machine learning, hardware, and product teams.
Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
Support the operational needs of both cloud-based and distributed edge computing environments.

What you will need

3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.
Hands-on experience with AWS or other major cloud platforms.
Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.
Solid Linux administration and troubleshooting skills.
Experience with Docker, Kubernetes, and containerized workloads.
Experience with Infrastructure as Code tools such as Terraform.
Proficiency in at least one scripting or programming language (Python, Bash, etc.).
Understanding of networking fundamentals and infrastructure security concepts.
Experience supporting production systems and participating in incident response.
Strong automation mindset and commitment to operational excellence.

Nice-to-haves

Experience operating large-scale edge computing or IoT deployments.
Familiarity with zero-trust access management platforms.
Experience in security operations, threat detection, or infrastructure security.
Exposure to AI infrastructure, LLM-based applications, or workflow automation platforms.
Knowledge of AI-Ops, anomaly detection, or intelligent monitoring solutions.
Familiarity with compliance and security frameworks such as ISO 27001.

Site Reliability Engineer (SRE)

Explore more