Senior Machine Learning Infrastructure Engineer

Alameda, California

Evolve Group, Inc.

About the Company - Join a pioneering AI research lab committed to developing foundational models across various domains including large language models (LLMs), image and video generation, and robotics. This vibrant, hands-on environment is designed for elite engineers eager to create next-generation machine learning infrastructure at scale.

About the Role - We are on the lookout for a Senior Machine Learning Infrastructure Engineer who will play a crucial role in designing and constructing distributed training systems for advanced AI models. This technical position demands profound expertise in ML infrastructure, distributed computing, and the intricacies of large-scale model training.

Responsibilities

Design and refine distributed training infrastructure tailored for the needs of massive-scale AI models.
Establish and manage multi-node, GPU-based training clusters (12+ nodes, 100+ GPUs).
Troubleshoot and enhance ML training performance through NCCL, CUDA, and PyTorch optimizations.
Implement and refine data and model parallelism strategies (FSDP, DDP, DeepSpeed).
Develop efficient infrastructure for data sharding, sampling, and pipeline execution.
Create and oversee cluster performance metrics and diagnostic tools for troubleshooting (GKE/Kubernetes, logging, and debugging tools).
Collaborate closely with research teams to ensure infrastructure aligns perfectly with pioneering AI model development.

Required Skills

Experience: A minimum of 5 years in ML infrastructure, ML systems engineering, or AI platform engineering.
Background: Demonstrated experience in leading AI research labs or organizations focusing on large-scale AI models (e.g., OpenAI, DeepMind, Meta AI, NVIDIA, Anthropic, etc.).

Preferred Skills

Expertise in Distributed Training: Experience with multi-node training clusters and optimizing GPU compute.
Familiarity with ML Frameworks: Proficient in PyTorch and PyTorch Lightning.
Cluster Management Skills: Experience with GKE/Kubernetes and cloud-based ML training setups.
Knowledge of Parallelism Techniques: Skilled in data/model parallelism, FSDP, DDP, DeepSpeed.
Debugging & Optimization Expertise: Proficient in NCCL, CUDA, and network enhancements for maintaining training stability.
Cultural Fit: A dynamic and mission-focused individual who thrives in a high-energy startup atmosphere and is eager to build ML infrastructure from the ground up.

Why Join?

Engage in cutting-edge AI research by building foundational models from the ground up.
Become part of a small, expert team tackling challenging ML infrastructure problems.
Make a tangible impact on scalable AI projects while working alongside top researchers and engineers.
Enjoy competitive compensation and meaningful equity at a fast-growing AI lab.

This is an urgent hire, and we are reviewing candidates immediately. If you are an expert in ML Infrastructure and are eager to contribute to groundbreaking AI research, apply now.

Date Posted: 13 April 2025

Apply for this Job

Show me similar jobs

Send me jobs by email