Job Summary:
We are seeking a seasoned MLOps Engineer with 7+ years of experience to join our team. The ideal candidate will have a strong background in cloud infrastructure (AWS/Azure), CI/CD pipeline development, and system monitoring using Datadog. This role requires close collaboration with AI/ML teams to support and optimize Large Language Models (LLMs) in production. The candidate must also demonstrate strong proficiency in IaC, container orchestration, and automation scripting.
Key Responsibilities:
Cloud Infrastructure Management
- Design, implement, and maintain scalable, secure, and high-performance infrastructure on AWS and Azure.
- Manage services like ECS, S3, Lambda, VPC, API Gateway, and CloudFront.
CI/CD Development
- Develop, manage, and optimize CI/CD pipelines using GitHub Actions for automated testing, deployment, and integration.
System Monitoring & Incident Management
- Set up and manage Datadog for end-to-end infrastructure and application monitoring.
- Create dashboards, configure alerts, and lead root cause analysis for incidents to improve system reliability.
Operational Support for LLMs
- Collaborate with AI/ML teams to manage model deployment pipelines and optimize runtime performance of LLMs.
- Support real-time and batch inference systems and fine-tune operational aspects of ML model serving.
Infrastructure as Code
- Automate infrastructure provisioning using Terraform or AWS CloudFormation.
- Maintain version-controlled infrastructure and promote reusable IaC modules.
Container Orchestration
- Deploy and manage containers using Kubernetes or AWS ECS.
- Design scalable, fault-tolerant containerized environments for ML workflows.
Automation & Scripting
- Develop automation scripts in Python or Bash for system maintenance, data handling, and deployment orchestration.
Security & Compliance
- Implement security best practices across cloud environments, including access management, encryption, and compliance enforcement.
Documentation & Best Practices
- Maintain clear documentation of infrastructure architecture, CI/CD flows, and monitoring protocols.
- Provide technical mentorship and contribute to continuous process improvement.
Technology Watch
- Stay updated on the latest MLOps trends, tools, and industry practices to help evolve the DevOps/MLOps strategy.
Must-Have Skills:
- Strong hands-on experience with AWS services: ECS, S3, Lambda, VPC, API Gateway, CloudFront
- Expertise in GitHub Actions, Git, and CI/CD pipeline automation
- Proficiency in Datadog (dashboards, alerting, incident response)
- Proficiency in Python and Bash for automation and scripting
- Strong command over Terraform and IaC best practices
- Deep understanding of container orchestration with Kubernetes or AWS ECS
- Operational experience supporting LLMs or ML model deployments in production
- Sound knowledge of security best practices in cloud and DevOps environments
- Excellent verbal and written communication skills
Preferred Qualifications:
- AWS Certification (Solutions Architect, DevOps Engineer, or equivalent)
- Experience with Azure ML or similar ML Ops platforms
- Exposure to ML model versioning tools (e.g., MLflow, DVC)
- Experience with API management and serverless architectures