Must Have Technical Skills: - Open Shift or GKC (Google Kubernetes Engine)
- Expertise in SRE principles and know how to apply them to infrastructure (bridge between infrastructure and dev)
- SRE > reactionary, dealing with optimizations and issues once the applications are running
- Prometheus
Job Description SUMMARY The Site Reliability Engineer (SRE) is responsible for improving system reliability and
resilience. This role focuses on building automation to reduce manual effort and prevent
service-impacting incidents. The SRE combines software and systems engineering to
build and support large-scale, distributed, fault-tolerant systems. This role ensures that
critical platforms are available, reliable, and able to support a fast rate of improvement.
This role relies on monitoring platforms and is continually taking a holistic view of system
health and performance. The SRE will enhance and support cloud-based
transformations, and is focused on pushing capabilities forward, staying ahead of
customer needs and innovating for continuous improvement. The SRE provides
operational support and engineering for multiple large-scale distributed software
applications
JOB DUTIES• Gathers and analyzes metrics from monitoring platforms to assist in performance tuning
and fault tolerance.
• Partners with development teams to improve services through testing and release
procedures.
• Participates in system design, platform management and capacity planning.
• Balances feature development speed and reliability with service-level objectives.
• Works closely with the incident response team and restoring service to normal operation.
• Understands debugging and applying troubleshooting skills.
• Investigates, blocks and rate-limits unwanted traffic.
• Utilizes monitoring systems and dashboards for proactive changes and alerting.
• Establishes continuous process improvement cycles where the process, performance,
and supporting technologies are reviewed and enhanced where applicable.
• Performs other duties as assigned.
EDUCATION & EXPERIENCE Typically requires a bachelor's degree and five (5) to seven (7) years of experience in a
technology and/or software engineering role or an equivalent combination.
KNOWLEDGE, SKILLS, ABILITIES• Understanding of Kubernetes, containers, clusters and elastic scalability.
• Expertise in SRE principles.
• Mindset of continually finding ways to drive scalability, stability, and performance.
• Cloud Services experience with Google Cloud Platform (GCP).
• Experience with API, service-based or microservice-based architecture.
• Proficiency in infrastructure, network, database, operating systems or security
troubleshooting and remediation.
• Architecture-level knowledge of Windows and Linux and Infrastructure systems.
• Experience with production deployment, monitoring and operational support for enterprise-class applications (Dynatrace a plus).
• Experience working with Continuous Integration/ Continuous Deployment tools.
• Experience in performance diagnostics, capacity planning, performance architecture
design, performance tuning and performance monitoring.
• A strong mix of software engineering and operational support skills.
• Knowledge of web technologies - HTTP, proxy, java, etc.
• Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana.