We’re looking for a Site Reliability Engineer to join our DevOps team in Tallinn and take ownership of keeping the Reconeyez platform healthy and available. You’ll monitor our infrastructure, respond to incidents during on-call shifts, diagnose issues across the stack, and continuously improve our operational posture, while building and automating the tools and processes that keep our systems running reliably.

This is a hands-on role. You’ll spend your time in dashboards, terminals, and log files. When something breaks, you’re the one who finds out why and makes sure it doesn’t happen again.

What You’ll Do

Platform Reliability & Incident Response

Keep production running, monitor system health, respond to alerts, and resolve incidents before they impact customers
Participate in on-call rotation with the DevOps team, taking responsibility for incident response and resolution during your shifts
Document runbooks and incident postmortems so the team learns from every outage
Collaborate with development teams to improve reliability, flag recurring issues, and advocate for operational improvements

Infrastructure & Automation

Build and set up new development tools and infrastructure; deploy updates and fixes
Work on ways to automate and improve development and release processes using Git-based workflows and PR-based operations
Manage containerized services running on Docker/Podman,deployments, restarts, resource management, and health checks
Configure and maintain network services including firewalls, load balancers, and VPNs
Contribute to infrastructure-as-code practices to make infrastructure changes auditable and repeatable

Observability & Monitoring

Manage and improve monitoring using Zabbix, Grafana, Prometheus, and Alertmanager, build dashboards, tune alerts, reduce noise
Adopt and extend OpenTelemetry instrumentation across services for unified tracing, metrics, and logging
Analyze logs to identify root causes, spot patterns, and catch problems early
Monitor AI/ML inference endpoints and model-serving infrastructure, track latency, throughput, and model health alongside traditional service metrics
Monitor and flag infrastructure cost anomalies to support cloud spend awareness across the team

Databases & Security

Install, monitor, and maintain PostgreSQL, backups, recovery, performance tuning, and query troubleshooting
Ensure systems are safe and secure against cybersecurity threats, including container image scanning and supply chain security practices
Ensure systems are safe and secure against cybersecurity threats

Platform Engineering

Reduce cognitive load for development teams through tooling, automation, and self-service capabilities
Build internal tools and processes that help developers move faster without sacrificing reliability

Must have

Solid experience with Linux systems administration,comfortable in a terminal and able to navigate a production system under pressure
Hands-on experience with Docker and/or Podman for managing containerised services
Working knowledge of Grafana, Zabbix, and/or Prometheus for monitoring and alerting
Familiarity with OpenTelemetry as a modern observability standard
Experience with log analysis and troubleshooting, reading logs, correlating events, tracing issues across services
Knowledge of systems and platform security, including secrets management and access control
Comfortable with Git-based workflows and PR-based infrastructure changes
Willingness to be on-call, you understand the responsibility and can respond effectively during off-hours
Calm under pressure, incidents happen; you stay focused, communicate clearly, and fix things methodically
Independent problem solver, when an alert fires at 2 AM, you can diagnose and act without someone guiding you
Strong communicator and team player, able to work closely with colleagues in Tallinn
Fluency in English and Estonian

Nice to Have

Experience with Elasticsearch or similar log aggregation and search platforms
Experience administering PostgreSQL, backups, performance tuning, query troubleshooting
Familiarity with infrastructure-as-code tools such as Ansible, Salt, or Terraform
Networking fundamentals,DNS, firewalls, load balancers, VPNs
Exposure to AI/ML infrastructure, model serving, inference endpoints
Experience with supply chain security practices, container image scanning, dependency auditing
Experience with incident management processes and tooling
Degree in Computer Science, Engineering, or a related field, or equivalent practical experience

Level

Mid-level (2–5+ years in operations, DevOps, or systems administration). We value reliability instincts and troubleshooting depth over breadth across cloud platforms.

Other Details

Reports to the DevOps team manager in Tallinn
New role with immediate start
Career progression opportunities as the company grows
Competitive salary, discussed and agreed based on qualifications and experience

Site Reliability Engineer - Defendec/Reconeyez

Job description

Explore more

Career resources

Similar jobs

SSD Customer Validation Engineer

Verification and Validation Engineer

RF/HW & System Validation Engineer

Power & Performance Validation Engineer

Senior Power & Performance Validation Engineer

Staff -Power and Performance Validation Engineer