Sr. Site Reliability Engineer - Incident Response
Company: Cox Automotive
Location: Smyrna
Posted on: January 1, 2026
|
|
|
Job Description:
The Site Reliability Engineer - Incident Response is a critical
enterprise-level role responsible for accelerating incident
resolution and enhancing the overall incident management process.
This individual partners with engineering teams during active
incidents to troubleshoot issues using monitoring and logging
tools, and post-incident, delivers executive-level summaries that
clearly communicate impact, root cause, and resolution. The SRE -
Incident Response also plays a key role in analyzing incident
response effectiveness and identifying opportunities for systemic
improvements. Core Competencies and Qualifications: Bachelor's
degree in a related discipline and 4 years' experience in a related
field. The right candidate could also have a different combination,
such as a master's degree and 2 years' experience; a Ph.D. and up
to 1 year of experience; or 16 years' experience in a related
field. Applicants must currently be authorized to work in the
United States for any employer without current or future
sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in
future. Engineering/Tooling: Demonstrates the ability to design,
build, and maintain engineering solutions and tools that enhance
reliability, automate incident response, and reduce operational
toil. Incident Troubleshooting: Skilled in interpreting logs,
metrics, and traces to assist in identifying root causes during
live incidents. Monitoring & Observability: Proficient in tools
such as Datadog, Splunk, New Relic, or similar platforms. Strong
programming background in Python, Java, or C# , with experience
building, maintaining, and troubleshooting production-grade
services and automation tools. Proven ability to design and
implement reliable, scalable, and highly available systems,
leveraging software engineering best practices to improve system
resilience and operational efficiency. Experience developing
automation and tooling to reduce toil, improve incident response,
and support continuous improvement across monitoring, deployment,
and recovery processes. Ability to collaborate closely with
software engineering teams to influence architecture and
operational readiness, ensuring reliability is built into the
system from design through production. AI Centric Engineering:
Effectively leverages artificial intelligence (AI) and machine
learning (ML) tools to automate, optimize, and enhance daily
engineering and incident response tasks. Analytical Rigor: Strong
attention to detail in validating incident data and identifying
trends or gaps in response. DevOps & Architecture Knowledge:
Understanding full-stack systems, CI/CD pipelines, caching,
scaling, and cloud-native infrastructure. Metrics & Reporting:
Capable of calculating and interpreting key metrics like MTTA (Mean
Time to Acknowledge) and MTTR (Mean Time to Resolve). Here are the
responsibilities of this role when not tied to active on-call:
Post-Incident Review Development Draft and deliver executive
summaries post-incident Develop and coach teams on blameless
postmortems . Create templates, train facilitators, and help guide
root cause analysis (e.g., 5 Whys, fishbone diagrams). Maintain a
central library of learnings and cross-cutting themes. Incident
Process Improvement Actively support engineering teams during
incidents by helping diagnose and resolve issues quickly Navigate
and analyze data from observability platforms to make informed
inferences about root causes Analyze the effectiveness of incident
response to identify systemic reliability gaps. Standardize
incident response workflows (incident roles, comms, escalation
paths). Create or refine runbooks , incident command frameworks ,
and severity classification guides . Metrics and Insights Build
dashboards around incident frequency, MTTR, MTTA, and recurrence
rates. Use incident data to drive reliability of OKRs or
engineering investments. Tooling & AI Solutions Partner with
engineering teams to identify repetitive or high-impact tasks
suitable for automation. Develop, implement, and continuously
improve custom scripts, bots, and AI-driven workflows for
monitoring, alerting, and incident triage. Evaluate and integrate
emerging AI/ML technologies to optimize detection, root cause
analysis, and reporting. Ensure all tools and automations are
secure, maintainable, and aligned with organizational standards and
SRE best practices. Document and socialize new tools and AI
solutions, enabling adoption and knowledge sharing across teams.
Cross-Team Collaboration Collaborate with Engineering Managers and
Incident Commanders to gather and validate incident data Partner
with product teams, infra, and leadership to socialize reliability
best practices . Act as a reliability "consultant" to squads that
have impactful incidents. Recommend enhancements to monitoring,
alerting, and response processes to reduce future incident impact
USD 99,000.00 - 165,000.00 per year Compensation: Compensation
includes a base salary of $99,000.00 - $165,000.00. The base salary
may vary within the anticipated base pay range based on factors
such as the ultimate location of the position and the selected
candidate's knowledge, skills, and abilities. Position may be
eligible for additional compensation that may include an incentive
program. Benefits: The Company offers eligible employees the
flexibility to take as much vacation with pay as they deem
consistent with their duties, the company's needs, and its
obligations; seven paid holidays throughout the calendar year; and
up to 160 hours of paid wellness annually for their own wellness or
that of family members. Employees are also eligible for additional
paid time off in the form of bereavement leave, time off to vote,
jury duty leave, volunteer time off, military leave, and parental
leave.
Keywords: Cox Automotive, Marietta , Sr. Site Reliability Engineer - Incident Response, IT / Software / Systems , Smyrna, Georgia