Senior Site Reliability Engineer

Dallas, Texas

Saxon Global
Apply for this Job
Job Summary:

We are looking for a Site Reliability Engineer (SRE) who will be responsible for ensuring the reliability, availability, and performance of our production systems. As an SRE, you will work closely with cross development and engineering teams to design and implement tools and processes to automate deployment, observability, and troubleshooting of our applications and infrastructure supporting the deployment of new Android tablets to the stores.

This individual must be skilled and have professional experience with the core functions of Site Reliability Engineering including deployments, observability, monitoring, telemetry, and automation.

Please be sure to call out your experience in these areas and how your technical experience matches the requirements below in your resume.

Responsibilities:

Ensure the reliability, availability, and performance of our production systems as we scale

Develop and maintain monitoring and alerting systems to detect and respond to incidents in a timely manner

There is no on-call rotation but occasionally support planned deployment roll outs that may require working off-hours during store closure

Work with cross-functional teams to plan and execute scaling initiatives

Develop and maintain documentation of processes, procedures, and technical configurations

Requirements:

Strong written and verbal communication skills with peers, technical leads, project managers and product owners

Must be able to collaborate with customers and cross-functional teams to design, test and validate deliverable which meet or exceed expectations

Self-starter and highly motivated individual that is well-organized

Bachelor's degree in Computer Science or related field

5+ years of experience as a Site Reliability Engineer

Strong experience with automation tools and experience with automation scripting in Python

Experience with containerization technologies such as Docker and Kubernetes

Experience with cloud platforms such as Azure or AWS

Experience with monitoring and logging tools such as Datadog, Prometheus, Grafana or Splunk

Strong understanding of networking, security, and systems administration

Excellent problem-solving skills and attention to detail

Must be available to work core hours PST.

Preferred qualifications:

Experience with distributed systems and supporting a large retail business

Experience with infrastructure as code tools such as Terraform or CloudFormation

Experience with CI/CD tools such as Jenkins

Experience with incident ticketing systems such as ServiceNow and Jira for tracking stories

Familiarity with Agile/Scrum methodologies and DevOps principles

If you are passionate about ensuring the reliability and availability of systems in our stores and enjoy collaborating with cross-functional teams to solve complex problems, we encourage you to apply for this exciting opportunity as an SRE.
Date Posted: 01 April 2025
Apply for this Job