About HybridAI Company and SaaS product:
HybridAI is a cutting-edge SaaS and on-premise platform designed to revolutionize how companies manage and optimize their AI and IT infrastructure resources through the datacenter lifecycle. From CPU and GPU to memory, storage, and network - HybridAI harmonizes these critical components to drive significant improvements in performance, power efficiency, and cost.
Our platform, designed by esteemed professors in AI/ML, industry CIOs, and brought to market by experienced software industry executives, addresses the critical challenges CIOs face in navigating the rapidly evolving AI & IT landscape. Join us at HybridAI and be part of a team that is shaping the future of AI infrastructure with innovation, expertise, and a commitment to excellence.
Location: Remote
Experience: 4-5 years
Type: 3 months contract
Role Overview:
We are looking for a hands-on Infrastructure Engineer with deep expertise in VMware and OpenShift Kubernetes, combined with a proven track record of GPU and CPU optimization for high-performance workloads. The ideal candidate should have experience in deploying large language models (LLMs) on GPU-accelerated infrastructure, managing GPU allocation and tuning, and implementing Operators in OpenShift environments.
The candidate should have a strong understanding of infrastructure security, with practical knowledge of ISO 27001 compliance, and a passion for working in fast-paced startup environments. The right candidate is both a builder and optimizer-comfortable getting deep into systems while aligning performance, security, and compliance goals.This role will report into the Direactor of AI and Software Engineering.
Key Responsibilities:
- Create and manage development (Dev), UAT, and production (Prod) environments on bare metal and Red Hat Linux-based servers.
- Harden Linux servers for security compliance, ensuring systems pass VAPT (Vulnerability Assessment & Penetration Testing).
- Develop CI/CD pipelines from GitHub to Linux-based VMs running OpenShift Kubernetes clusters.
- Ensure high availability, observability, and proactive alerting for the HybridAI SaaS platform.
- Automate deployment of the HybridAI InfraMetrics Collector in customer on- prem environments.
- Work with VMware vCenter and Kubernetes Cluster APIs to manage infrastructure resources and automate deployments and provide guidance on VM Optimizations.
- Enable build cycles with expertise on virtualization, container orchestration, and hybrid infrastructure.
- Deploy LLMs on GPU infrastructure ensuring optimal resource allocation and scaling for AI-driven applications.
- Monitor infrastructure performance and implement proactive scaling solutions.
- Collaborate with Head of Software Engineering to enforce API security, access control, and compliance policies.
- Implement secure and compliant infrastructure aligned with ISO 27001 standards.
Experience:
- 4+ years of hands-on DevOps and infrastructure engineering experience managing enterprise-grade datacenter environments.
- Strong experience with Red Hat Linux and bare metal infrastructure management.
- Expertise in Linux security hardening (firewall configuration, SELinux, system patching).
- Deep knowledge of OpenShift Kubernetes (OCP) and container orchestration.
- Hands-on experience in CPU/GPU profiling, resource allocation, and performance tuning
- Experience with infrastructure as code (Terraform, Ansible)
- Proficiency in CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, ArgoCD) for OpenShift & Linux-based deployments.
- Hands-on experience with VMware stack (ESXi, vCenter, vMotion)
- Cloud and on-prem experience, with exposure to AWS, GCP, Azure, and private cloud platforms.
- Scripting and automation expertise (Bash, Python, Powershell).
- Strong security background, including API security, authentication (OAuth, JWT, mTLS), and compliance with CIS benchmarks.
- Experience with any of observability and monitoring tools, including: NVIDIA DCGM, Prometheus & Grafana, ELK Stack, DataDog, Splunk, or AppDynamics
- Solid experience in ISO 27001 compliance, security best practices, and policy implementation
- Comfortable working in agile, very fast-paced startup environments with ownership of infra outcomes
Nice-to-Have Skills
- Experience with service mesh architectures (Istio, Linkerd).
- Familiarity with Zero Trust security models.
- Exposure to air-gapped Kubernetes deployments for security-sensitive environments.
- Experience with automated compliance enforcement tools (OpenSCAP, Falco, Aqua Security).
- Knowledge of hybrid cloud networking (VPCs, VPNs, private links between on-prem and cloud).
- Hands-on experience with HashiCorp Vault for secrets management.
- Exposure to additional compliance frameworks such as SOC 2 or NIST
- Experience with AI/ML or HPC workloads beyond LLM applications