ABOUT THE COMPANY

Pilots don't train with real passengers. Actors don't rehearse with real audiences. Yet, the most consequential decisions in society are often pushed straight to production.

Simile is changing that. We have built the first AI simulation of society, populated by generative agents based on real humans. Our research pioneered the field of AI-based simulation, proving it is possible to model human behavior with high accuracy. Today, we are developing a Foundation Model to predict human behavior in any situation, at any scale.

We are backed by $100M in funding led by Index Ventures, with participation from Hanabi, A*, Bain Capital Ventures, and AI visionaries including Andrej Karpathy, Fei-Fei Li, Adam D'Angelo, and Guillermo Rauch.

ABOUT THE ROLE

As a Member of Technical Staff, Model Evaluations at Simile, you will build the measurement systems that determine whether our simulations of human behavior are accurate, trustworthy, and useful enough to guide real-world decisions. You will help shape what Simile measures, the quality bars we defend, and how evaluation evidence guides model, product, and customer decisions.

Evaluation at Simile brings together model evals, statistics, behavioral science, research methodology, product quality, and human judgment. Our models simulate people, populations, markets, and groups, which means our evals must reason about distributions, noisy human ground truth, uncertainty, qualitative outputs, behavioral data, and customer decision-making. You will work with unusually rich data about human behavior, including surveys, long-form interviews, customer studies, qualitative research, and behavioral signals such as transactions, product interactions, and other real-world traces.

We are hiring across several forms of expertise. Some candidates may be deep in LLM evaluation, model training, and research engineering. Others may bring exceptional strength in statistics, behavioral science, survey methodology, human data, product evaluation, or experimentation. Across backgrounds, we are looking for people who can reason clearly, build quickly, use agentic coding tools fluently, and take hands-on ownership of ambiguous evaluation problems.

The core question for this role is simple: How do we know when a simulation of human behavior is good enough to trust?

IN THIS ROLE, YOU WILL

Build the measurement layer for behavioral simulation: Design evals, metrics, rubrics, datasets, dashboards, and workflows that measure whether Simile’s models are accurately predicting human behavior across customer use cases, populations, question types, and decision contexts.
Partner with modeling to improve models: Evaluate new model versions, diagnose regressions, identify priority areas for model-improvement cycles, and maintain stable eval suites that represent capabilities customers actually care about.
Contribute to product and applied evals: Build evals for qualitative responses, retrieval, survey generation, AI-generated research reports, customer-facing outputs, and other product surfaces where model quality directly shapes customer trust. Turn subjective quality concerns into concrete rubrics, labeled data, automated graders, release criteria, and model-improvement signals.
Make ground truth and uncertainty legible: Develop rigorous ways to compare simulated responses against human data, customer studies, Simile-collected ground truth, and behavioral datasets. Help the company reason about sampling error, uncertainty, calibration, margin of error, representativeness, and what “ground truth” means when human behavior is inherently noisy.
Automate evaluation workflows: Use modern agentic coding tools to rapidly build internal tools, inspect model outputs, create labeling workflows, validate evals, and turn fuzzy evaluation questions into working systems. We value people who can compress long, ambiguous projects into fast, useful prototypes without losing sight of rigor or reliability.
Help define the future of behavioral simulation evals: Prototype ways to evaluate behavioral predictions using diverse sources of data, including transaction or purchase behavior, product interactions, intervention response, first-party experiments, and eventually multi-agent group settings.

REQUIREMENTS

MUST HAVES

Evaluation Taste: You have strong intuition for what makes an eval meaningful, robust, and decision-relevant. You can explain what an eval measures, what it does not measure, how it can be gamed, and why it should or should not affect a model or product decision.
LLM and Model Fluency: You understand the basics of modern LLM training, post-training, model evaluation, and hill-climbing. You do not need to be a modeling specialist, but you can read model outputs, understand modeling team needs, and reason about whether a model change actually improved the thing we care about.
Statistical Judgment: You are comfortable reasoning about noisy data, uncertainty, sampling, distributions, calibration, confidence intervals, measurement validity, bias, variance, and the difference between an observed result and the underlying population quantity it estimates.
Technical and Agentic Execution: You can build internal tools, scripts, dashboards, labeling workflows, analyses, or automated eval pipelines quickly. You are comfortable working with data and automation tools such as Python, SQL, R, notebooks, LLM APIs, and agentic coding tools such as Codex, Claude Code, Cursor, or equivalent systems. You know how to move quickly while still validating outputs, catching errors, and planning for the long-term..
Hands-On Ownership: You can independently drive a workstream while still doing the work yourself. You are willing to build the first version, inspect the data, debug the workflow, write the rubric, revise the metric, and keep going until the evaluation system is useful.

NICE TO HAVES

We do not expect one person to have all of these. We are hiring a team with complementary strengths.

Modeling / Model-Quality Dashboards: Experience building model evaluation dashboards, regression suites, release gates, benchmark sets, model comparison workflows, or systems that help ML teams decide where to focus and when to ship.
LLM-as-Judge and Human Data: Experience designing rubrics, automated graders, pairwise comparisons, expert review workflows, labeling interfaces, grader calibration, or human/model hybrid evaluation systems.
Survey Methodology and Statistics: Experience with sampling, weighting, margin of error, power analysis, uncertainty quantification, Bayesian modeling, causal inference, psychometrics, polling, or measurement theory.
Behavioral Simulation: Experience evaluating behavioral predictions beyond self-reported survey responses, such as transaction data, purchase behavior, mobility data, product interactions, or other passively collected behavioral signals.
Behavioral Economics / Experimentation: Experience designing RCTs, A/B tests, survey experiments, vignette studies, field experiments, behavioral games, or intervention studies.
Multi-Agent or Group Behavior: Interest or experience in modeling group conversation, deliberation, focus groups, juries, committees, polarization, collective decision-making, or social influence.

You might be a great fit if you have worked in LLM evals, applied ML research, data science, research engineering, human data, market research, UXR, polling, behavioral science, computational social science, or behavioral economics. You might also be a recent graduate or self-directed builder with unusually strong taste in evaluation, statistics, and AI tools.

You do not need to match every bullet. If you do not perfectly see yourself in this JD but believe you would be exceptional at building the measurement layer for behavioral simulation, we would love to hear from you.

COMPENSATION & BENEFITS

At Simile, we provide competitive compensation packages that include base salary, equity, and comprehensive benefits.

Salary Range: $200,000 – $400,000 USD
Note: Final offers are based on experience, specialized skills, interview performance, and relevant training.
Equity: Grants are available for eligible roles, subject to board approval.
Health & Wellness: Comprehensive medical, dental, and vision coverage.
Time Off: Flexible time off policies to support work-life balance.

OUR PROCESS

We prioritize thoughtful conversations and clear examples of past work. Our hiring journey is designed to help both sides align on fit, working style, and expectations.

Reapplication Policy: To ensure a fair and thorough evaluation for all applicants, Simile observes a 90-day waiting period before reconsidering candidates for the same role.

COMMITMENT TO DIVERSITY & INCLUSION

Equal Opportunity: Simile is an equal opportunity workplace. We welcome applicants of all backgrounds and identities, valuing an environment where everyone can contribute authentically.

Accommodations: If you require support or reasonable accommodations during the application process due to a disability, please let us know. We are happy to assist.

Evaluations - Member of Technical Staff

Job description

Career resources

Similar jobs

Revenue Operations Lead

Software Engineer, Full Stack, Developer Productivity

Influencer Marketing Talent Associate

Contract Sourcer - Product Management (Remote)

People & Culture Coordinator

Manager, Growth Strategy and Operations