Senior Machine Learning Infrastructure Architect

Fremont, California

Evolve Group, Inc.
Apply for this Job
About the Company - We are collaborating with an innovative AI research lab dedicated to building foundational models from the ground up, including large language models (LLMs), image/video generation, and robotics. This is a dynamic, hands-on environment ideal for top engineers eager to create state-of-the-art machine learning infrastructure at scale.

About the Role - We are looking for a Senior Machine Learning Infrastructure Architect to design and construct distributed training systems for large-scale AI models. This critical role demands extensive expertise in ML infrastructure, distributed computing, and the training of large-scale models.

Responsibilities
  • Architect and optimize distributed training infrastructure for massive-scale AI models.
  • Set up and maintain multi-node, GPU-based training clusters (12+ nodes, 100+ GPUs).
  • Debug and enhance ML training performance (NCCL, CUDA, PyTorch pipeline optimization).
  • Implement and improve data and model parallelism techniques (FSDP, DDP, DeepSpeed).
  • Develop infrastructure for efficient data sharding, sampling, and pipeline execution.
  • Build and monitor cluster performance and diagnostics for failures (GKE/K8s, logging, and debugging tools).
  • Collaborate closely with research teams to ensure infrastructure aligns with the needs of innovative AI model development.
Required Skills
  • Experience: 5+ years in ML infrastructure, ML systems engineering, or AI platform engineering.
  • Background: Proven experience at leading AI research labs or companies working on large-scale AI models (e.g., OpenAI, DeepMind, Meta AI, NVIDIA, Anthropic, etc.).
Preferred Skills
  • Distributed Training: Expertise in multi-node training clusters and GPU compute optimization.
  • ML Frameworks: Proficiency with PyTorch and PyTorch Lightning.
  • Cluster Management: Experience with GKE/Kubernetes and cloud-based ML training setups.
  • Parallelism Techniques: Knowledge of data/model parallelism, FSDP, DDP, and DeepSpeed.
  • Debugging & Optimization: Skills in NCCL, CUDA, and network optimizations for training stability.
  • Mindset & Culture Fit: Highly driven, mission-oriented, and able to thrive in a high-intensity startup environment, excited to build foundational ML infrastructure for training models from the ground up (not merely fine-tuning existing ones).
Why Join?
  • Engage in pioneering AI research-creating foundational models from scratch.
  • Become part of a small, elite team addressing some of the most challenging ML infrastructure issues.
  • Have a direct influence on AI at scale, working alongside top-tier researchers and engineers.
  • Enjoy competitive compensation and meaningful equity in a rapidly growing AI lab.
This is an urgent hire, and we are reviewing candidates immediately. If you are an ML Infrastructure expert eager to work on groundbreaking AI research, we encourage you to apply now.

Date Posted: 02 May 2025
Apply for this Job