Nvidia
Manager, Software Engineering - NCCL
Company
Role
Manager, Software Engineering - NCCL
Location
China
Job type
Full time
Posted
1 hour ago
Salary
Job description
We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication libraries like NCCL & NVSHMEM for Deep Learning and HPC. DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! We are looking for a dynamic and technical leader for our China NCCL team. This is an outstanding opportunity to push the limits on the state-of-the-art and deliver platforms the world has never seen before. Are you ready to contribute to the development of innovative technologies and help realize NVIDIA's vision?
What you will be doing:
Lead, mentor, and grow our China engineering team. Own the end-to-end execution spanning planning, prioritization, quality control and performance.
Interact with customers and researchers to understand their use cases and requirements. Collaborate with engineering, program and product management, and partners to define the product roadmap.
Contribute to feature design and implementation.
Continuously review and identify improvement opportunities in established processes, infrastructure, and practices to ensure the teams are accomplishing work in the most efficient and transparent manner.
What we need to see:
10+ overall years of experience in the software industry with 4+ years of management experience.
Bachelors, Masters, or Ph.D. in CS, CE, EE (related technical field) or equivalent experience.
Specialization in systems software, communication runtimes, or high performance networking. Proven success in managing several complex initiatives or products through the full product life cycle.
Strong understanding of computer systems architecture, networking technologies (RDMA, RoCE, Ethernet, EFA, InfiniBand) and topologies, operating systems principles (aka systems software fundamentals), HW-SW interactions and performance analysis/optimizations.
Hands-on C/C++ programming and debugging skills in Linux.
Experience balancing multiple projects with competing priorities. Flexibility to work and communicate effectively across different teams and timezones.
Ways to stand out from the crowd:
Active user or developer of NCCL!
Customer engagement experience in this space.
Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, NIXL, OpenSHMEM, UCX, UCC).
Experience with programming using CUDA, MPI, OpenMP, OpenACC, pthreads.
Knowledge of HPC and ML/DL fundamentals. Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, vLLM, SGLang, TRT-LLM, etc.


