MCPNew: now works with Claude & AI assistants
Alembic

Alembic

Senior Network & Site Reliability Engineer

Company

Alembic

Role

Senior Network & Site Reliability Engineer

Job type

Full-time

Found on Mokaru

2 weeks ago

Share this job

Salary

$210k - $240k/yearly

Job description

ABOUT US

Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence.

We're backed by a $145M Series B from WndrCo (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX SuperPOD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.)

ABOUT THE ROLE

We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role.

You'll design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions.

It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it.

WHAT YOU'LL DO

  • Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
  • Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
  • Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
  • Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.
  • Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post-incident analysis and continuous improvement.
  • Ensure security, compliance, and operational readiness across our network and cloud infrastructure.
  • Partner across engineering and data science to drive a culture of performance and reliability.

WHAT WILL HELP YOU SUCCEED

  • 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
  • A strong background in network security, architecture, design, and operations.
  • Extensive hands-on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs.
  • Experience designing and operating modern datacenter network fabrics (spine-leaf, EVPN/VXLAN, ECMP).
  • Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar).
  • WAN engineering — carrier circuit provisioning and external network peering.
  • Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.
  • Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
  • Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross-functional communication.

ALSO HELPFUL

  • NVIDIA networking technologies — Cumulus Linux, InfiniBand, Spectrum-X, and BlueField DPUs (this is the fabric behind our SuperPOD).
  • Familiarity with data-intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI).
  • Security practices for applications and infrastructure, and experience in high-compliance or SOC 2 environments.

THE ROLE IS RIGHT FOR YOU IF

  • You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running.
  • You'd rather build and automate than direct from a distance, and you want meaningful influence over how a high-performance platform scales.

WHY YOU MIGHT BE EXCITED ABOUT ALEMBIC

  • Hard problems with real impact: You'll own the network and reliability layer behind systems that influence multimillion-dollar decisions at Fortune 100 companies.
  • Cutting-edge technology: Operate our own NVIDIA DGX SuperPOD on Grace Blackwell — one of the fastest private supercomputers in the world — and run a fabric (InfiniBand, Spectrum-X, BlueField) almost no company has in-house.
  • Technical autonomy: Ownership over architecture decisions and the freedom to solve hard infrastructure problems your way.
  • Elite team: Join top engineers who thrive on hard problems and high-impact work.
  • Series B momentum, real ownership: Meaningful equity at a Series B company that's raised $145M, with proven product-market fit and Fortune 100 traction.

WHY YOU MIGHT NOT BE EXCITED

  • If you only want to tell people what to build instead of building and automating alongside them, this isn't the environment for you.
  • You prefer companies with 100% built-out process for every detail.
  • You prefer static over dynamic — projects and priorities adapt as we grow. We have real paying customers and a playbook, and we still move at startup speed at Series B scale.
Resume ExampleCover Letter Example

Explore more