At Canva, our mission is to empower the world to design. To get cutting-edge research into the hands of millions of users faster, we're looking for a Machine Learning Engineer focused on research enablement and performance, turning promising experiments into stable, scalable, user-facing capabilities while making training and inference faster, cheaper, and more reliable.

About the role:

You'll be the bridge between research and production. Partnering closely with researchers, you'll ensure experimental code is production ready, integrate models into our monorepo, build shared libraries and services, and create the tooling and processes that let multiple model variants ship safely and quickly. You'll also work across the training stack, profiling and tuning PyTorch workloads, improving GPU utilisation, and shaping how we use distributed training and storage to get the most out of our compute. Your work shortens the research-to-user loop, reduces duplication, and ensures our ML features are reliable, observable, and easy for other teams to adopt.

At the moment, this role is focused on:

Research-to-Production Pipeline: Hardening experimental models (containerisation, tests, CI/CD), making them deployable for real users.
Training Performance and GPU Efficiency: Profiling PyTorch training jobs, improving GPU utilisation, and applying techniques like mixed precision, efficient data loading, and distributed training strategies (FSDP, DDP, DeepSpeed) to reduce time and cost per experiment.
Library development: Converting experiments into well-factored libraries with clear APIs, dependency hygiene, and versioning, so teams can import rather than copy-paste.
Developer Experience & Documentation: Creating templates, examples, and guidance; offering supportive, high-signal communication so others can adopt libraries confidently.
Reliability, Observability & Cost: Instrumenting services with metrics/logging/tracing, setting SLIs/SLOs, and optimising training and inference performance and spend.

Primary Responsibilities:

Productionise research models: refactor, test, containerise, and integrate them into the monorepo for scalable reuse.
Profile and optimise PyTorch training jobs, working with researchers to identify bottlenecks across compute, memory, I/O, and networking.
Improve distributed training setups (multi-GPU, multi-node) and help teams pick the right parallelism strategy for their workload.
Build and maintain inference services, SDKs, and shared libraries that standardise pre/post-processing and interfaces across variants.
Create multi-variant runners and rollout frameworks (feature flags, canaries, A/B testing, automated rollbacks).
Establish CI/CD workflows, artifact management, and reproducible builds for ML services and model assets.
Add robust observability (dashboards, alerts) and reliability practices (load tests, chaos/resiliency checks) across training and inference workloads.
Optimise inference (batching, caching, quantisation/compilation, hardware utilisation) to reduce latency and cost.
Work across the broader training stack, including Kubernetes orchestration, storage (e.g. Weka, Vast, Lustre), and data pipelines, to remove friction for research teams.
Partner with researchers and product engineers via code reviews, pair sessions, and clear documentation to accelerate adoption.
Drive good engineering hygiene in the research codebase: testing strategy, dependency management, and de-duplication across multiple model variants.

You're probably a match if you:

Have strong software engineering fundamentals and excellent Python skills; you're comfortable turning notebooks and prototypes into production-grade services.
Have shipped ML systems in production (containers, APIs, CI/CD), ideally within a monorepo environment.
Have hands-on experience optimising PyTorch training or inference, profiling workloads, and reasoning about GPU memory, compute, and throughput.
Are comfortable in containerised environments and understand Kubernetes concepts well enough to debug and improve ML workloads running on it.
Can read research code and refactor it into clean abstractions with stable, well-documented interfaces.
Understand service reliability and observability (metrics, tracing, logging) and how they apply to ML systems.
Think holistically about the stack, from storage and networking through to model code, and can hold a credible conversation with researchers, DevOps, and platform engineers alike.
Communicate clearly and empathetically, especially when guiding others to adopt libraries and best practices and mentoring engineers earlier in their ML journey.
Bring cloud experience (AWS a plus) without needing to be a deep specialist.

Nice to Have:

Familiarity with model-serving/optimisation tooling (e.g., ONNX, TorchScript, Triton, quantisation).
Experience writing or optimising CUDA kernels, or using compilation frameworks (torch.compile, Triton, TensorRT) to speed up models.
Experience with distributed training frameworks (FSDP, DDP, DeepSpeed, Megatron) at meaningful scale.
Familiarity with high-performance storage systems (Weka, Vast, Lustre) and the data loading patterns that make or break training throughput.
Experience with experimentation platforms (feature flags, A/B testing) and safe rollout strategies.
Background with multimodal/image generation stacks or LLM-adjacent tooling (not the core focus, but helpful).
Knowledge of MLOps practices (model registries, artifact stores, dependency/version management).

Impact you'll have:

You'll dramatically reduce the time it takes to move from a successful experiment to a reliable, observable feature in production. You'll eliminate copy-paste, unify interfaces, enable parallel variants, and build the shared foundations that let Canva ship ML innovation at scale. You'll also help our research teams get more out of every GPU hour, making training faster and inference cheaper as we scale up the work CORE is doing.

What's in it for you?

Achieving our crazy big goals motivates us to work hard - and we do - but you'll experience lots of moments of magic, connectivity and fun woven throughout life at Canva, too. We also offer a range of benefits to set you up for every success in and outside of work.

Here's a taste of what's on offer:

Equity packages - we want our success to be yours too
Inclusive parental leave policy that supports all parents & carers
An annual Vibe & Thrive allowance to support your wellbeing, social connection, office setup & more
Flexible leave options that empower you to be a force for good, take time to recharge and supports you personally

Check out lifeatcanva.com for more info.

Other stuff to know

We make hiring decisions based on your experience, skills and passion, as well as how you can enhance Canva and our culture. We see AI as a powerful amplifier of creativity and technology at Canva. We’re evolving how we assess AI skills in our Technology hiring experience - you’ll tackle interactive, real-time challenges that reflect the kind of work we do. In some interviews, you may also be asked to solve a problem using an AI tool to show how you approach challenges with tech by your side.

Please note that interviews are conducted virtually.

Senior Machine Learning Engineer - Research Optimisation

Job description

Explore more

Career resources

Similar jobs

Computer and Information Research Scientist - Aerospace Medical Research (AMR21)

Trading Infrastructure Engineer

Regulatory Compliance Associate

Medical Monitor

Growth (Performance Marketing)

Growth (Analyst, Junior)