rime
Machine Learning Scientist
Salary
Job description
MACHINE LEARNING SCIENTIST
Rime builds voice AI for enterprises running customer experiences at scale. Our text-to-speech models are purpose-built for high-volume conversational deployments, engineered for the pronunciation accuracy, latency, and deployment flexibility that production environments actually demand.
We started from a different premise than the rest of the field: voice AI isn't bottlenecked by model architecture. It's bottlenecked by data. So before we trained a single model, we built our own corpus: full-duplex, studio-quality conversational speech, recorded and annotated by PhD linguists. That's our moat. It's also why enterprises pick Rime when pilots need to convert into production.
We're backed by top-tier investors including Unusual Ventures, and we've built a team at the intersection of product, research, and craft. Building voice models is an art. We intend to master it.
ROLE OVERVIEW
We're hiring a Machine Learning Scientist to push the frontier of speech synthesis and speech understanding at Rime.
What You'll Own
- Design, train, and evaluate speech synthesis models, autoregressive and non-autoregressive.
- Drive research on full-duplex and half-duplex multi-modal architectures, including unified S2S systems.
- Choose and iterate on speech representations: neural codecs, semantic tokens, mel features, continuous latents.
- Build rigorous evaluation, objective and perceptual. Hold the bar on quality and prosodic control.
- Collaborate with our linguists on TTS frontend behavior so modeling and frontend choices reinforce each other.
WHAT WE'RE LOOKING FOR
- Deep familiarity with the speech synthesis literature, contemporary and historical — Tacotron, FastSpeech, VITS, VALL-E, the codec-LM lineage. Opinions on what worked and why.
- Hands-on training with neural codecs (EnCodec, DAC, Mimi, etc.) and multiple representation choices.
- Experience with full- or half-duplex multi-modal modeling (Moshi, LLaMA-Omni, streaming S2S).
- Strong attention to detail on data quality. You notice when an annotation pipeline is silently degrading or when an eval set has leakage.
- Willing to roll up your sleeves on unglamorous data and training work — paired with the agency to build pipelines so the team isn't stuck doing it by hand.
- Working knowledge of TTS frontend (G2P, normalization, prosody) and experience working with linguists.
- Strong PyTorch fundamentals. Comfortable with training loops, distributed training, model internals.
- PhD or equivalent research experience in speech, audio, ML, or computational linguistics or a track record that makes the credential irrelevant.
Nice to have
- Multilingual TTS experience.
- Background in prosody or paralinguistics.
- Published work in speech, audio, or core ML venues.
- Experience taking research models to production: quantization, distillation, streaming inference.
WHY JOIN RIME
- Category-defining voice AI infrastructure, not incremental research deltas.
- Direct collaboration with founders, including a CEO with a Stanford computational linguistics PhD.
- Real impact on company trajectory.
- Meaningful equity upside.
- High ownership, high standards, low bureaucracy.
What We Offer
- Competitive base + meaningful early-stage equity
- Remote-friendly
- Visa sponsorship available
- Access to a proprietary, full-duplex, studio-quality conversational speech corpus
- Compute and tooling to do the work
- Direct influence on the future of voice AI
At Rime, we...
- Are outliers
- Cut through the hype to focus on the craft
- Move fast with agency and freedom
- Maintain a growth mindset, finding joy in the struggle
- Do the right things, knowing that it'll lead to making money
If that sounds like you too, you'll be a great fit for Rime!


