ML Infrastructure Engineer

Sunnyvale, California

Evolve Group
Job Expired - Click here to search for similar jobs

About the Company

  • We are partnering with a cutting-edge AI research lab that is building foundation models from the ground up-across large language models (LLMs), image/video generation, and robotics. This is a high-intensity, hands-on environment for top-tier engineers who want to build state-of-the-art machine learning infrastructure at scale.

About the Role

  • We are seeking a Senior ML Infrastructure Engineer to design and build distributed training systems for large-scale AI models. This role is highly technical and requires deep expertise in ML infrastructure, distributed computing, and large-scale model training.

Responsibilities

  • Architect and optimize distributed training infrastructure for massive-scale AI models.
  • Set up and maintain multi-node, GPU-based training clusters (12+ nodes, 100+ GPUs).
  • Debug and optimize ML training performance (NCCL, CUDA, PyTorch pipeline optimization).
  • Implement and optimize data and model parallelism techniques (FSDP, DDP, DeepSpeed).
  • Develop infrastructure for efficient data sharding, sampling, and pipeline execution.
  • Build and monitor cluster performance and failure diagnostics (GKE/K8s, logging, and debugging tools).
  • Work closely with research teams to ensure infrastructure meets the needs of frontier AI model development.

Required Skills

  • Experience: 5+ years in ML infrastructure, ML systems engineering, or AI platform engineering.
  • Background: Proven experience at top AI research labs or companies working on large-scale AI models (eg, OpenAI, DeepMind, Meta AI, NVIDIA, Anthropic, etc.).

Preferred Skills

  • Distributed Training: Multi-node training clusters, GPU compute optimization.
  • ML Frameworks: PyTorch, PyTorch Lightning.
  • Cluster Management: GKE/Kubernetes, cloud-based ML training setups.
  • Parallelism Techniques: Data/model parallelism, FSDP, DDP, DeepSpeed.
  • Debugging & Optimization: NCCL, CUDA, network optimizations for training stability.
  • Mindset & Culture Fit: Highly driven, mission-focused, and thrives in a high-intensity startup environment. Excited to build ML infrastructure for training models from scratch (not just fine-tuning existing ones).

Why Join?

  • Work on cutting-edge AI research-building foundation models from scratch.
  • Join a small, elite team solving some of the hardest ML infrastructure challenges.
  • Have a direct impact on AI at scale, working alongside top researchers and engineers.
  • Competitive compensation and meaningful equity in a fast-growing AI lab.

This is an urgent hire, and we are reviewing candidates immediately. If you are an ML Infrastructure expert looking to work on groundbreaking AI research, apply now.

Date Posted: 29 April 2025
Job Expired - Click here to search for similar jobs