JustinBradley's client, a leading source of mortgage financing, is seeking an AWS Incident Management Specialist to join their team and manage IT production incidents to resolution in a 24/7/365 environment using our client's incident management processes. You will guide incident triage calls from a technical perspective, utilize monitoring tools and dashboards to aid troubleshooting, share technical insights, outline resolution activities, and drive improvements in incident management processes. You will also provide regular status updates to stakeholders, assist with postmortem activities, and support efforts related to operational enhancements and application maintenance in production.
Key Responsibilities:
- Incident Management: Lead and manage IT production incidents to resolution using incident management processes. Communicate the incident status, impact, and resolution actions effectively to stakeholders. Participate in triage calls and manage incident response in a timely and accurate manner.
- AWS Expertise: Utilize hands-on experience managing and monitoring AWS-based applications. Troubleshoot and resolve incidents related to AWS cloud infrastructure (EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, WAF, etc.) in real-time.
- Performance Engineering: Conduct performance engineering for AWS cloud applications. Utilize tools like Dynatrace and Splunk for transaction-level monitoring and troubleshooting. Leverage AWS tools and resources to analyze and resolve incidents promptly.
- Monitoring Tools Management: Manage and monitor AWS cloud applications and underlying infrastructure using monitoring tools like Extrahop, SolarWinds, Netcool, Catchpoint, MoogSoft, and others. Analyze dashboards and monitoring data to identify trends and patterns in application performance and health.
- Incident Triage & Resolution: Lead and guide technical incident triage calls, analyze various components of the infrastructure (AWS, UNIX, DNS, LDAP, SSL, etc.), and perform detailed root-cause analysis using wire data analytics, event correlation, and performance management tools.
- Documentation & Postmortems: Assist with the creation of Root Cause Analysis (RCA) and Correction of Errors (COE) documentation. Participate in postmortem activities and recommend improvements to prevent future incidents. Ensure effective follow-up on items that could negatively impact production operations.
- Process Improvement: Recommend and implement improvements to incident management processes. Provide recommendations on process changes, create reports, and respond to ad-hoc requests from senior management.
- On-Call Support: Participate in an on-call rotation, working nights, weekends, and holidays as required to provide continuous support for incident management and resolution.
- Stakeholder Communication: Report incident details and metrics to senior leadership. Effectively communicate complex technical issues to non-technical stakeholders.
Education & Experience:
- Education: Bachelor's Degree or equivalent required.
- Experience: Minimum of 6 years of relevant experience managing IT incidents and troubleshooting in a cloud environment, particularly AWS.
Specialized Knowledge & Skills:
- Extensive experience managing AWS cloud environments, including services like EC2, RDS, Lambda, DynamoDB, CloudWatch, and more.
- Hands-on experience troubleshooting infrastructure and application incidents on AWS.
- Experience with transaction-level monitoring using tools like Dynatrace and Splunk.
- Expertise in analyzing various components of the application and infrastructure, including AWS, UNIX, LDAP, DNS, SSL, and databases (Oracle/MS SQL).
- Proven ability to manage complex incidents and lead triage calls with cross-functional technical teams.
- Strong communication skills, including the ability to convey technical details to non-technical stakeholders.
- Ability to multi-task and perform well under pressure in high-stress situations.
- Familiarity with monitoring and observability tools such as SolarWinds, Extrahop, MoogSoft, and Catchpoint.
- AWS certifications (e.g., AWS Solution Architect - Associate or higher) preferred.
Preferred Qualifications:
- Familiarity with tools like CloudFormation or Terraform.
- Experience troubleshooting Middleware products in UNIX/Linux environments and knowledge of Service Oriented Architecture (SOA), Java, etc.
- Exposure to other cloud platforms like Azure or Google Cloud.
- Experience with OpenTel and monitoring dashboards for incident detection and alerting.
Work Environment:
- 24/7/365 operational support environment.
- Ability to work various shifts, including nights, weekends, and holidays, as required.
JustinBradley is an EO employer - Veterans/Disabled and other protected employees.