Infrastructure Engineer

Kozhikode, Kerala

HybridAI
Job Expired - Click here to search for similar jobs

About HybridAI Company and SaaS product:


HybridAI is a cutting-edge SaaS and on-premise platform designed to revolutionize how companies manage and optimize their AI and IT infrastructure resources through the datacenter lifecycle. From CPU and GPU to memory, storage, and network - HybridAI harmonizes these critical components to drive significant improvements in performance, power efficiency, and cost.


Our platform, designed by esteemed professors in AI/ML, industry CIOs, and brought to market by experienced software industry executives, addresses the critical challenges CIOs face in navigating the rapidly evolving AI & IT landscape. Join us at HybridAI and be part of a team that is shaping the future of AI infrastructure with innovation, expertise, and a commitment to excellence.


Location: Remote

Experience: 4-5 years

Type: 3 months contract


Role Overview:

We are looking for a hands-on Infrastructure Engineer with deep expertise in VMware and OpenShift Kubernetes, combined with a proven track record of GPU and CPU optimization for high-performance workloads. The ideal candidate should have experience in deploying large language models (LLMs) on GPU-accelerated infrastructure, managing GPU allocation and tuning, and implementing Operators in OpenShift environments.


The candidate should have a strong understanding of infrastructure security, with practical knowledge of ISO 27001 compliance, and a passion for working in fast-paced startup environments. The right candidate is both a builder and optimizer-comfortable getting deep into systems while aligning performance, security, and compliance goals.This role will report into the Direactor of AI and Software Engineering.


Key Responsibilities:

  • Create and manage development (Dev), UAT, and production (Prod) environments on bare metal and Red Hat Linux-based servers.
  • Harden Linux servers for security compliance, ensuring systems pass VAPT (Vulnerability Assessment & Penetration Testing).
  • Develop CI/CD pipelines from GitHub to Linux-based VMs running OpenShift Kubernetes clusters.
  • Ensure high availability, observability, and proactive alerting for the HybridAI SaaS platform.
  • Automate deployment of the HybridAI InfraMetrics Collector in customer on- prem environments.
  • Work with VMware vCenter and Kubernetes Cluster APIs to manage infrastructure resources and automate deployments and provide guidance on VM Optimizations.
  • Enable build cycles with expertise on virtualization, container orchestration, and hybrid infrastructure.
  • Deploy LLMs on GPU infrastructure ensuring optimal resource allocation and scaling for AI-driven applications.
  • Monitor infrastructure performance and implement proactive scaling solutions.
  • Collaborate with Head of Software Engineering to enforce API security, access control, and compliance policies.
  • Implement secure and compliant infrastructure aligned with ISO 27001 standards.

Experience:

  • 4+ years of hands-on DevOps and infrastructure engineering experience managing enterprise-grade datacenter environments.
  • Strong experience with Red Hat Linux and bare metal infrastructure management.
  • Expertise in Linux security hardening (firewall configuration, SELinux, system patching).
  • Deep knowledge of OpenShift Kubernetes (OCP) and container orchestration.
  • Hands-on experience in CPU/GPU profiling, resource allocation, and performance tuning
  • Experience with infrastructure as code (Terraform, Ansible)
  • Proficiency in CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, ArgoCD) for OpenShift & Linux-based deployments.
  • Hands-on experience with VMware stack (ESXi, vCenter, vMotion)
  • Cloud and on-prem experience, with exposure to AWS, GCP, Azure, and private cloud platforms.
  • Scripting and automation expertise (Bash, Python, Powershell).
  • Strong security background, including API security, authentication (OAuth, JWT, mTLS), and compliance with CIS benchmarks.
  • Experience with any of observability and monitoring tools, including: NVIDIA DCGM, Prometheus & Grafana, ELK Stack, DataDog, Splunk, or AppDynamics
  • Solid experience in ISO 27001 compliance, security best practices, and policy implementation
  • Comfortable working in agile, very fast-paced startup environments with ownership of infra outcomes

Nice-to-Have Skills

  • Experience with service mesh architectures (Istio, Linkerd).
  • Familiarity with Zero Trust security models.
  • Exposure to air-gapped Kubernetes deployments for security-sensitive environments.
  • Experience with automated compliance enforcement tools (OpenSCAP, Falco, Aqua Security).
  • Knowledge of hybrid cloud networking (VPCs, VPNs, private links between on-prem and cloud).
  • Hands-on experience with HashiCorp Vault for secrets management.
  • Exposure to additional compliance frameworks such as SOC 2 or NIST
  • Experience with AI/ML or HPC workloads beyond LLM applications

Date Posted: 26 April 2025
Job Expired - Click here to search for similar jobs