rivian
Rivian Factory SRE Intern - AI, Cloud & Observability Tooling
Company
Role
Rivian Factory SRE Intern - AI, Cloud & Observability Tooling
Location
Job type
FULL_TIME
Posted
9 hours ago
Salary
Job description
About Rivian Rivian is on a mission to keep the world adventurous forever. This goes for the emissions-free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract. As a company, we constantly challenge what’s possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations. Role Summary Rivian’s Factory SRE team owns the reliability, scalability, and observability of critical manufacturing systems at our plants. We operate across hybrid/on‑prem and cloud platforms, applying modern SRE practices to keep production lines healthy and highly available. The team builds internal tooling, observability platforms, and automation that support factory applications and infrastructure, similar to the broader SRE and Observability teams that manage telemetry, LGTM/Grafana stacks, and reliability tooling for production environments. As a Factory SRE Intern, you will help design and build AI‑assisted reliability tools, improve cloud‑native infrastructure, and enhance observability for factory systems. You’ll work with senior SREs who own telemetry platforms, SLOs, and internal tooling for metrics, logs, and traces across AWS/EKS and hybrid environments. This internship is hands‑on: you’ll write code, deploy to real environments, and help improve the developer and operations experience for teams that run our manufacturing systems. Responsibilities • Contribute to observability and platform tooling tooling that ingests, stores, and visualizes telemetry (logs, metrics, traces) for factory systems, leveraging platforms similar to LGTM/Grafana and SRE observability stacks used elsewhere at Rivian. • Help design and build AI‑ or ML‑assisted workflows for SRE use cases, such as: Triage assistants that summarize incidents using telemetry. Anomaly detection or pattern detection on logs and metrics. Recommendation or “next‑step” helpers for on‑call playbooks. • Implement and improve cloud‑native infrastructure components for SRE tooling (e.g., services running on Kubernetes/EKS, AWS‑backed data stores, and supporting CI/CD or GitOps workflows) modeled on the patterns from the Platform and Observability SRE teams. • Build internal tools and CLIs that make it easier for engineers to: • Discover and debug services using standardized dashboards and alerts. • Self‑serve access to logs, traces, and metrics. • Safely roll out and validate changes against observability and reliability guardrails. Support incident analysis and tooling: • Use logs/metrics/traces to help root‑cause issues in lower‑risk environments. • Contribute to post‑incident automation (e.g., scripts, dashboards, bots) that prevent repeat issues or speed remediation. • Partner with other platform and manufacturing engineering teams to integrate new services into our observability and tooling ecosystem, influenced by how SRE Observability and Platform teams collaborate company‑wide. Qualifications • Currently pursuing a Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Data Science, or a related technical field. • Experience with at least one general‑purpose programming language (such as Python, Go, or TypeScript/JavaScript) and comfort writing production‑quality scripts or small services, aligning with languages commonly used for observability and SRE tooling at Rivian. • Familiarity with Linux and basic shell tools. • Exposure to cloud concepts (e.g., AWS, GCP, or Azure) and containerization (Docker or Kubernetes), even from coursework or personal projects, in line with how SRE and Platform teams operate cloud‑native stacks. • Interest in observability practices (metrics, logs, traces, dashboards, alerting) and how they enable reliability, echoing the focus of Rivian’s Observability and Platform SRE groups. • Curiosity about AI/ML applications in operations: using models or AI tooling to assist debugging, automation, or decision‑making. • Strong written and verbal communication skills; ability to collaborate with engineers from different teams and disciplines, as emphasized in SRE roles across the organization. Preferred Qualifications • Hands‑on exposure to: • Any observability tools (e.g., Grafana, Prometheus, Loki, OpenTelemetry, Datadog, Splunk, or similar) as used in Rivian’s observability and platform stacks. • Building small web services or APIs (Flask/FastAPI, Node.js, Go, etc.). • Deploying apps to cloud or containerized environments (Kubernetes, Docker Compose). • Coursework or project experience in machine learning or AI, especially: • Working with embeddings, vector search, anomaly detection, or applying LLMs to structured/unstructured data. • Building simple pipelines that process and analyze logs, metrics, or event streams, reflecting interest in telemetry and observability. • Familiarity with Git‑based workflows and CI/CD, similar to the GitOps and automation patterns used by existing SRE and Platform teams. Behavioral Expectations • Demonstrate ownership, curiosity, and a learning mindset - ask questions, proactively seek feedback, and iterate. • Collaborate effectively with Factory SRE, Platform, and Observability engineers; help build a culture of reliability, clear communication, and continuous improvement, aligned with broader SRE values at Rivian. • Uphold Rivian’s Compass Values in daily work, especially around sustainability, safety, and supporting our manufacturing teams so they can build vehicles reliably and safely at scale. What You’ll Learn During this internship you’ll gain exposure to: • How SRE is practiced in a factory and industrial systems context, not just web applications, similar in spirit to existing Factory Infrastructure & Systems SRE roles. • Designing observability at scale, including: • Log/metric/trace pipelines and ingestion patterns. • Dashboards and SLO‑based alerting. • Telemetry cost and cardinality trade‑offs. • Cloud‑native infrastructure concepts (Kubernetes/EKS, containers, CI/CD, Terraform‑style IaC patterns) as used by Rivian’s Platform and Observability SRE teams. • Practical applications of AI in reliability engineering, such as: Using LLMs on top of observability data. Automating repetitive SRE workflows. Pay Disclosure We offer a competitive compensation package, with details to be discussed during the interview process. Equal Opportunity Rivian is an equal opportunity employer and complies with all applicable federal, state, and local fair employment practices laws. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. Rivian is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at candidateaccommodations@rivian.com. Candidate Data Privacy Rivian may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes (“Candidate Personal Data”). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian affiliates; and (iii) Rivian’s service providers, including providers of background checks, staffing services, and cloud services. Rivian may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, the United Kingdom, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please note that we are currently not accepting applications from third party application services. • Contribute to observability and platform tooling tooling that ingests, stores, and visualizes telemetry (logs, metrics, traces) for factory systems, leveraging platforms similar to LGTM/Grafana and SRE observability stacks used elsewhere at Rivian. • Help design and build AI‑ or ML‑assisted workflows for SRE use cases, such as: Triage assistants that summarize incidents using telemetry. Anomaly detection or pattern detection on logs and metrics. Recommendation or “next‑step” helpers for on‑call playbooks. • Implement and improve cloud‑native infrastructure components for SRE tooling (e.g., services running on Kubernetes/EKS, AWS‑backed data stores, and supporting CI/CD or GitOps workflows) modeled on the patterns from the Platform and Observability SRE teams. • Build internal tools and CLIs that make it easier for engineers to: • Discover and debug services using standardized dashboards and alerts. • Self‑serve access to logs, traces, and metrics. • Safely roll out and validate changes against observability and reliability guardrails. Support incident analysis and tooling: • Use logs/metrics/traces to help root‑cause issues in lower‑risk environments. • Contribute to post‑incident automation (e.g., scripts, dashboards, bots) that prevent repeat issues or speed remediation. • Partner with other platform and manufacturing engineering teams to integrate new services into our observability and tooling ecosystem, influenced by how SRE Observability and Platform teams collaborate company‑wide. • Currently pursuing a Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Data Science, or a related technical field. • Experience with at least one general‑purpose programming language (such as Python, Go, or TypeScript/JavaScript) and comfort writing production‑quality scripts or small services, aligning with languages commonly used for observability and SRE tooling at Rivian. • Familiarity with Linux and basic shell tools. • Exposure to cloud concepts (e.g., AWS, GCP, or Azure) and containerization (Docker or Kubernetes), even from coursework or personal projects, in line with how SRE and Platform teams operate cloud‑native stacks. • Interest in observability practices (metrics, logs, traces, dashboards, alerting) and how they enable reliability, echoing the focus of Rivian’s Observability and Platform SRE groups. • Curiosity about AI/ML applications in operations: using models or AI tooling to assist debugging, automation, or decision‑making. • Strong written and verbal communication skills; ability to collaborate with engineers from different teams and disciplines, as emphasized in SRE roles across the organization. Preferred Qualifications • Hands‑on exposure to: • Any observability tools (e.g., Grafana, Prometheus, Loki, OpenTelemetry, Datadog, Splunk, or similar) as used in Rivian’s observability and platform stacks. • Building small web services or APIs (Flask/FastAPI, Node.js, Go, etc.). • Deploying apps to cloud or containerized environments (Kubernetes, Docker Compose). • Coursework or project experience in machine learning or AI, especially: • Working with embeddings, vector search, anomaly detection, or applying LLMs to structured/unstructured data. • Building simple pipelines that process and analyze logs, metrics, or event streams, reflecting interest in telemetry and observability. • Familiarity with Git‑based workflows and CI/CD, similar to the GitOps and automation patterns used by existing SRE and Platform teams. Behavioral Expectations • Demonstrate ownership, curiosity, and a learning mindset - ask questions, proactively seek feedback, and iterate. • Collaborate effectively with Factory SRE, Platform, and Observability engineers; help build a culture of reliability, clear communication, and continuous improvement, aligned with broader SRE values at Rivian. • Uphold Rivian’s Compass Values in daily work, especially around sustainability, safety, and supporting our manufacturing teams so they can build vehicles reliably and safely at scale. What You’ll Learn During this internship you’ll gain exposure to: • How SRE is practiced in a factory and industrial systems context, not just web applications, similar in spirit to existing Factory Infrastructure & Systems SRE roles. • Designing observability at scale, including: • Log/metric/trace pipelines and ingestion patterns. • Dashboards and SLO‑based alerting. • Telemetry cost and cardinality trade‑offs. • Cloud‑native infrastructure concepts (Kubernetes/EKS, containers, CI/CD, Terraform‑style IaC patterns) as used by Rivian’s Platform and Observability SRE teams. • Practical applications of AI in reliability engineering, such as: Using LLMs on top of observability data. Automating repetitive SRE workflows.
Explore more
Similar jobs
Medical Monitor (Gastroenterology)
Psicro
Senior Full Stack Software Engineer, Communications Platform
rivian
General Manager(07017) 19599 Frontage Road
Dominos
Automotive "Flying Doctor" Technician (freelancer contract)
Msxinternational
Research Executive - Project Manager (Quantitative research)
Nielseniq
Sr. Software Development Engineer (SRE)
Renesaselectronics