Mlops Engineer

Dallas, Texas

Ascentt
Apply for this Job

Job Summary:

We are seeking a seasoned MLOps Engineer with 7+ years of experience to join our team. The ideal candidate will have a strong background in cloud infrastructure (AWS/Azure), CI/CD pipeline development, and system monitoring using Datadog. This role requires close collaboration with AI/ML teams to support and optimize Large Language Models (LLMs) in production. The candidate must also demonstrate strong proficiency in IaC, container orchestration, and automation scripting.


Key Responsibilities:

Cloud Infrastructure Management

  • Design, implement, and maintain scalable, secure, and high-performance infrastructure on AWS and Azure.
  • Manage services like ECS, S3, Lambda, VPC, API Gateway, and CloudFront.

CI/CD Development

  • Develop, manage, and optimize CI/CD pipelines using GitHub Actions for automated testing, deployment, and integration.

System Monitoring & Incident Management

  • Set up and manage Datadog for end-to-end infrastructure and application monitoring.
  • Create dashboards, configure alerts, and lead root cause analysis for incidents to improve system reliability.

Operational Support for LLMs

  • Collaborate with AI/ML teams to manage model deployment pipelines and optimize runtime performance of LLMs.
  • Support real-time and batch inference systems and fine-tune operational aspects of ML model serving.

Infrastructure as Code

  • Automate infrastructure provisioning using Terraform or AWS CloudFormation.
  • Maintain version-controlled infrastructure and promote reusable IaC modules.

Container Orchestration

  • Deploy and manage containers using Kubernetes or AWS ECS.
  • Design scalable, fault-tolerant containerized environments for ML workflows.

Automation & Scripting

  • Develop automation scripts in Python or Bash for system maintenance, data handling, and deployment orchestration.

Security & Compliance

  • Implement security best practices across cloud environments, including access management, encryption, and compliance enforcement.

Documentation & Best Practices

  • Maintain clear documentation of infrastructure architecture, CI/CD flows, and monitoring protocols.
  • Provide technical mentorship and contribute to continuous process improvement.

Technology Watch

  • Stay updated on the latest MLOps trends, tools, and industry practices to help evolve the DevOps/MLOps strategy.

Must-Have Skills:

  • Strong hands-on experience with AWS services: ECS, S3, Lambda, VPC, API Gateway, CloudFront
  • Expertise in GitHub Actions, Git, and CI/CD pipeline automation
  • Proficiency in Datadog (dashboards, alerting, incident response)
  • Proficiency in Python and Bash for automation and scripting
  • Strong command over Terraform and IaC best practices
  • Deep understanding of container orchestration with Kubernetes or AWS ECS
  • Operational experience supporting LLMs or ML model deployments in production
  • Sound knowledge of security best practices in cloud and DevOps environments
  • Excellent verbal and written communication skills

Preferred Qualifications:

  • AWS Certification (Solutions Architect, DevOps Engineer, or equivalent)
  • Experience with Azure ML or similar ML Ops platforms
  • Exposure to ML model versioning tools (e.g., MLflow, DVC)
  • Experience with API management and serverless architectures

Date Posted: 02 May 2025
Apply for this Job