About the Company
- We are partnering with a cutting-edge AI research lab that is building foundation models from the ground up-across large language models (LLMs), image/video generation, and robotics. This is a high-intensity, hands-on environment for top-tier engineers who want to build state-of-the-art machine learning infrastructure at scale.
About the Role
- We are seeking a Senior ML Infrastructure Engineer to design and build distributed training systems for large-scale AI models. This role is highly technical and requires deep expertise in ML infrastructure, distributed computing, and large-scale model training.
Responsibilities
- Architect and optimize distributed training infrastructure for massive-scale AI models.
- Set up and maintain multi-node, GPU-based training clusters (12+ nodes, 100+ GPUs).
- Debug and optimize ML training performance (NCCL, CUDA, PyTorch pipeline optimization).
- Implement and optimize data and model parallelism techniques (FSDP, DDP, DeepSpeed).
- Develop infrastructure for efficient data sharding, sampling, and pipeline execution.
- Build and monitor cluster performance and failure diagnostics (GKE/K8s, logging, and debugging tools).
- Work closely with research teams to ensure infrastructure meets the needs of frontier AI model development.
Required Skills
- Experience: 5+ years in ML infrastructure, ML systems engineering, or AI platform engineering.
- Background: Proven experience at top AI research labs or companies working on large-scale AI models (eg, OpenAI, DeepMind, Meta AI, NVIDIA, Anthropic, etc.).
Preferred Skills
- Distributed Training: Multi-node training clusters, GPU compute optimization.
- ML Frameworks: PyTorch, PyTorch Lightning.
- Cluster Management: GKE/Kubernetes, cloud-based ML training setups.
- Parallelism Techniques: Data/model parallelism, FSDP, DDP, DeepSpeed.
- Debugging & Optimization: NCCL, CUDA, network optimizations for training stability.
- Mindset & Culture Fit: Highly driven, mission-focused, and thrives in a high-intensity startup environment. Excited to build ML infrastructure for training models from scratch (not just fine-tuning existing ones).
Why Join?
- Work on cutting-edge AI research-building foundation models from scratch.
- Join a small, elite team solving some of the hardest ML infrastructure challenges.
- Have a direct impact on AI at scale, working alongside top researchers and engineers.
- Competitive compensation and meaningful equity in a fast-growing AI lab.
This is an urgent hire, and we are reviewing candidates immediately. If you are an ML Infrastructure expert looking to work on groundbreaking AI research, apply now.