Site Reliability Engineer

Bellevue, Washington

Infosys

Apply for this Job

Required Qualification:

Bachelor's degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education.
At least 2 years of Information Technology experience.
SRE Mindset in Production support : Proactive issue identification using observability tools.
Skilled in using different monitoring & observability tools to track system performance
Incident commander: Ability to diagnose complex issues and actively drive incident calls working with technical, product SMEs, and Tier 2 SREs.
Experience in Splunk (including Splunk APM and Splunk O11y), AppDynamics,
Experience in DB, Network, Linux/Unix, Kubernetes
Experience in APM, NMON, Wireshark usage and analysis

Preferred Qualification:

Knowledge of Grafana, RedMetrics, 1000Eyes
Knowledge of VMs, Load balancers, Firewalls, API Gateways,
Knowledge of Containerization, Docker, AWS, PCF, GCP, ServiceNow (including AIOps, tools for Self-Heal and automated playbooks)
Experience in UEM and synthetic monitoring tools
System Administration: Strong knowledge of infrastructure, including command-line tools and system internals. (Kubernetes triage, linux administration)
Networking: Understanding of network protocols, configurations, and troubleshooting. (nmon, Wireshark)
Cloud Computing: Experience with cloud understanding, including cloud architecture (on-perm and public) and services. (AWS and Azure)
Application Management: Familiarity with continuous integration and continuous deployment processes and tools.
Advanced programming knowledge: Experience with triaging issues with application code. (Java, Python)
DB troubleshooting: Familiarity in troubleshooting issues with traditional and NoSQL databases (eg: Oracle, SQL Server, MySQL, MongoDB, Cassandra)
Monitoring and Observability: Skills in using monitoring tools to track system performance and detect issues including all the Back End systems, database, and API's (Splunk, AppDynamics, Splunk o11y, Open Telemetry)
Ability to diagnose and resolve complex issues quickly and efficiently
Collaboration: Strong communication skills to work effectively with cross-functional teams
Adaptability: Flexibility to handle changing priorities and technologies
Attention to Detail: Precision in managing configurations and deployments to avoid errors
Communication : Excellent communicator who could interact with Director/Sr. Director and above.
Production support activities including proactive identification of issues leveraging observability tools with the aim of reducing MTTD and MTTR
Coordinate all activities required to lead incident triage in compliance with SLAs and OLAs. Corelating inputs from various dashboards & tools to drive resolution.
Flexibility to work in rotation (as and when needed)

Date Posted: 21 May 2025

Apply for this Job

Show me similar jobs

Send me jobs by email