XTB is a global company from the financial industry, focusing on online trading of financial instruments. We are the largest FinTech in Poland and a leader in Central and Eastern Europe, and the range of our operations covers several countries, including Asia and South America. At XTB, we focus on the development of our employees, giving them opportunities to gain knowledge and skills in various fields, as well as offering a number of training and development programs. If you are looking for challenges and want to gain valuable experience in an international business environment, XTB is the right place for you.

We are a certified Great Place to Work company.

We are seeking a Senior Site Reliability Engineer to define and drive the reliability of XTB systems at the scale of millions of clients. In this role, you will strengthen SRE practices and shape the resilience of our entire technology stack through high-impact observability, ensuring our systems remain robust and scalable.

Responsibilities

Observability Platform Engineering: Develop a standardized observability ecosystem. Implement a conscious telemetry model focusing on structured events, distributed tracing, and intelligent sampling strategies - that provides deep, actionable insights into system behavior.
Reliability Enablement: Act as a strategic partner to product engineering teams, providing the platform, standards, and data they need to own service reliability. Use error budgets and alerting as the primary language for balancing feature velocity with stability.
Proactive Resilience & Protection: Enhance detection capabilities to identify issues before they impact the customer. Leverage early-warning systems and AI/ML for automated anomaly detection and intelligent data analysis to continuously verify and strengthen system resilience.
Operations & Tooling: Build internal automation and tooling that streamlines SRE workflows, automates routine operational tasks, and enhances efficiency across the technology stack.
Incident Management & On-Call Rotation: Participate in an on-call rotation to provide incident management, ensuring rapid incident resolution, effective communication, and post-incident analysis to drive continuous improvement.

Requirements

Professional Background: At least 5 years of professional experience in SRE, Infrastructure, or DevOps roles managing high-scale, distributed environments.
Technical Engineering: Advanced programming skills in Python, with a strong focus on building scalable automation, internal tooling, and robust scripts.
Cloud & Orchestration: Hands-on expertise in managing production-grade Kubernetes environments, configuration management tools like Ansible, and designing resilient infrastructure architectures within Azure Kubernetes Service and on-prem environments.
Observability Engineering: Deep proficiency in building standardized telemetry ecosystems. You have mastered self-hosted opensource tools for observability data collection, storage and visualization. like Prometheus, Grafana, ELK Stack, Tempo, Thanos, Jaeger and similar.
Operational & Soft Skills: Ability to drive incident management, conduct thorough post-incident analysis, and foster a culture of reliability and shared ownership.
AI & Automation: Ability to leverage AI/ML techniques for SRE tasks, such as AIOps, automated anomaly detection, log analysis, and optimizing reliability workflows.
Bonus Tech: While we prioritize open-source standards, experience with commercial observability and APM solutions (e.g., Datadog, Splunk, New Relic) or chaos engineering frameworks is highly valued.

What we offer

Real influence on the development of the company and the product.
Work in an experienced team that is happy to share its knowledge.
A clear vision of development thanks to regular feedback and clear career paths.
Regular team-building meetings.

Benefits

A training budget for courses and conferences that interest you.
An extra day off on your birthday.
An extra day off for parents.
Equipment tailored to your needs.
Private medical care and group insurance.
Access to an e-learning platform for learning English and a benefits platform.
Access to a wellbeing platform and the opportunity to take advantage of workshops and private therapy sessions.
Remote work, from the office in Warsaw or from a coworking space in your city.

Senior Site Reliability Engineer

Job description

Explore more

Career resources

Similar jobs

Senior Validation Engineer / Validation SME – Cleanrooms, Classified Environments, and Clean Utilities

Sr. Validation Engineer II

Cloud DevOps & Validation Engineer - Roche Cloud Platform

Launch Validation Engineer III

Linux Kernel & BSP Software Validation Engineer

Cloud DevOps & Validation Engineer - Roche Cloud Platform

Career resources

Similar jobs

Senior Validation Engineer / Validation SME – Cleanrooms, Classified Environments, and Clean Utilities

Sr. Validation Engineer II

Cloud DevOps & Validation Engineer - Roche Cloud Platform

Launch Validation Engineer III

Linux Kernel & BSP Software Validation Engineer

Cloud DevOps & Validation Engineer - Roche Cloud Platform