Job Summary: We are seeking a highly skilled Site Reliability and Operations Engineer (SRE) with a robust background in Kubernetes-based distributed caching and compute grid systems. The ideal candidate will possess a solid blend of infrastructure engineering and software development skills. This role will focus on the design, implementation, and maintenance of high-performance distributed platforms to ensure high availability, scalability, and system observability.
Job Responsibilities: Development & Implementation:
Design, build, and enhance distributed caching and compute grid solutions on Kubernetes/OpenShift platforms.
Leverage technologies such as IBM Spectrum Symphony, Tibco Grid Server, or similar for high-throughput compute grids.
Utilize containerization tools (Docker, Helm) to orchestrate microservices and container workloads.
Apply parallel compute strategies and optimize load balancing for application performance.
Site Reliability Engineering (SRE):
Ensure platform reliability, scalability, and minimal downtime by maintaining robust distributed systems.
Implement and maintain observability and monitoring using Prometheus, Grafana, ELK, or OpenTelemetry.
Automate infrastructure provisioning and deployments using Ansible, Helm Charts, and similar tools.
Troubleshoot complex system and infrastructure issues in Kubernetes environments.
Support CI/CD processes using tools like Jenkins, ArgoCD, and GitHub Actions.
Required Skills & Qualifications: - Strong experience with Kubernetes, including OpenShift, across both on-prem and cloud environments.
- Proficiency in at least one programming language: Java, Go, or Python.
- In-depth knowledge of containerization technologies such as Docker and Helm.
- Hands-on experience with CI/CD tools and pipeline integration.
- Expertise in observability and monitoring using Prometheus, Grafana, Loki, Jaeger.
- Knowledge of service meshes like Istio or Linkerd.
- Experience in multi-cluster and hybrid cloud Kubernetes deployments.
- Solid understanding of networking, security practices, and performance optimization in distributed systems.
Preferred Qualifications: - Experience with high-performance computing platforms or grid computing frameworks.
- Familiarity with distributed caching strategies and data sharding.
- Strong communication and documentation skills.
- Relevant certifications (e.g., CKAD, CKA, Red Hat Certified Specialist in OpenShift).
Education: Bachelors Degree