Site Reliability Engineer

Irving, Texas

Resource Informatics Group
Apply for this Job
Role: SRE (Observability) Engineer
Start Date: December 16, 2024

Note: Before taking interview candidate need to write coding test, Immediate closure opportunity.

This position is remote. Candidates must pass a HackerEarth Assessment to qualify skills in Automation (Chef, Ansible, Terraform), Python, and general SRE. Please stay on top of your submitted candidates, as we will interview those that qualify next week.

Description

We are seeking a highly skilled SRE (Observability) Engineer with a deep understanding of modern observability practices and tools. The ideal candidate will have hands-on experience with provisioning, configuring, and developing infrastructure solutions, along with a strong focus on automation, scalability, and reliability. This role involves a mix of development, system architecture, and troubleshooting responsibilities, providing opportunities to influence the evolution of our infrastructure.
Responsibilities
  • Design, implement, and manage observability solutions using tools like Dynatrace, Prometheus, Thanos, or Grafana.
  • Develop metrics, alerts, and silences for comprehensive system monitoring.
  • Automate infrastructure tasks using Chef (recipes, cookbooks), Ansible (tasks, playbooks), or Terraform with a strong focus on syntax and GitLab CI/CD configuration.
  • Script solutions using Python, PowerShell, or Bash to enable automation across the infrastructure.
  • Propose and implement innovative ideas to reduce manual workload and improve operational efficiency through automation.
  • Provision and configure cloud resources via CLI or APIs on Azure, GCP, or AWS.
  • Troubleshoot and resolve system issues with an SRE (Site Reliability Engineering) mindset, focusing on root cause analysis and corrective actions.
  • Develop and enhance documentation, including application guides, runbooks, and system configurations, ensuring clarity in the "why" and "how" of operations.
  • Plan, design, and execute scalable and redundant system architecture to meet organizational goals.
Required Skills
  • Observability Tools: Hands-on experience with Dynatrace, Prometheus, Thanos, or Grafana.
  • Infrastructure Automation: Proficiency in Chef, Ansible, Terraform, and GitLab CI/CD.
  • Scripting Languages: Advanced skills in Python, PowerShell, or Bash.
  • Cloud Platforms: Proficient in provisioning and configuring resources on Azure and GCP (AWS experience acceptable).
  • SRE Practices: Familiarity with troubleshooting using SRE principles, root cause analysis, and corrective action planning.
  • Documentation: Strong ability to write clear, concise, and detailed technical documentation and runbooks.
  • System Architecture: Solid understanding of scalability and redundancy principles.
Preferred Skills
  • Kubernetes: Basic understanding of container orchestration and CLI.
  • Linux Administration: Configuration, package management, and troubleshooting expertise.
  • Networking: Knowledge of VPCs, proxies, CDNs, and their integration into scalable systems.
  • Storage Systems: Familiarity with block and object storage configuration.
Date Posted: 23 April 2025
Apply for this Job