karsun-llc
Technical Operations Lead
Company
Role
Technical Operations Lead
Location
Job type
Full-time
Found on Mokaru
3 weeks ago
Salary
Job description
Overview
Summary
This individual will lead technical operations for a cloud-native (AWS) data and AI platform supporting a federal program; own reliability, observability, incident response, platform engineering, and data-product operationalization.
Responsibilities
What You'll Be Doing
- Serve as primary technical owner for platform availability, reliability, and operational runbook development for data pipelines, feature stores, model serving, and supporting infrastructure.
- Work closely with the SRE Lead to design and operationalize SRE practices (SLIs/SLOs/SLAs, error budgets, toil reduction) to transition teams from DevOps to SRE.
- In collaboration with SRE Lead, build and maintain monitoring, alerting, and observability across data and AI stacks (ETL/ELT, data lakes/warehouses, model training & serving), including metrics, distributed tracing, and centralized logging.
- Lead incident management: on-call rotations, incident response, RCA, remediation tracking, and continuous improvement.
- In collaboration with SRE Lead, automate operational workflows (deployments, scaling, recovery) using IaC (Terraform/CloudFormation) and CI/CD pipelines; reduce manual operational toil.
- Define and enforce runbooks, backup/restore, RTO/RPO, and disaster recovery for data and ML systems.
- Partner with data product owners, ML engineers, security, and compliance to ensure production readiness, access controls, and federal compliance requirements.
- Manage capacity planning, cost optimization, and performance tuning of AWS resources for data and ML workloads.
- Mentor and lead an ops/SRE team; set technical priorities and coordinate cross-functional platform changes.
- Maintain vendor and third-party integrations and coordinate upgrades/patching under federal change-control processes.
- Track and report reliability metrics and operational maturity improvements to stakeholders
Qualifications and Education
Required Qualifications
- 10+ years of directly relevant IT work experience.
- 7+ years technical operations / platform / SRE experience supporting data-intensive systems; 3+ years in AWS production environments.
- Deep understanding of data products and product ownership: data lineage, stewardship, SLAs, and consumer contracts.
- Proven experience operating data platforms: Databricks, Airflow, S3, Kafka/Kinesis, Airflow.
- Strong SRE practice knowledge: SLI/SLO design, incident response, runbooks, chaos/failure-mode testing.
- Hands-on with observability tooling (Prometheus, Datadog, OpenTelemetry) and log/tracing systems.
- Familiar with IaC (Terraform or CloudFormation), CI/CD (GitHub Actions/Jenkins/ArgoCD), container orchestration (EKS/Kubernetes), and scripting (Python, Bash).
- Solid security and compliance experience for federal environments (RBAC, encryption, secrets management).
- Excellent written and verbal communication; ability to produce clear runbooks, RCA reports, and brief leadership.
Compensation
The proposed salary range for this role is $****** to $******* USD. The salary range provided is a good faith estimate representative of all experience levels. Karsun considers several factors when extending an offer, including but not limited to, the role, function and associated responsibilities, a candidate’s work experience, location, education/training, and key skills.


