Core SRE concepts questions

"Explain SLI, SLO, and SLA." An SLI is the actual measurement: request success rate, p99 latency, availability. An SLO is your internal target: 99.9% success rate. An SLA is a customer contract specifying consequences if the SLO is missed. SLOs should be aspirational but achievable; SLAs should be set below your actual SLO to maintain a buffer before contractual penalties.

"What is an error budget?" The allowable downtime implied by your SLO. At 99.9% uptime, your monthly error budget is 43 minutes. If you have consumed 90% of your budget, slow feature releases and prioritise reliability work. If your budget is healthy, move fast. Error budgets make the reliability trade-off explicit and remove the adversarial dynamic between product and SRE.

Incident management questions

"Walk me through how you would respond to a P1 incident where 30% of users cannot log in." Strong answer: Declare the incident (set up channel, assign commander), triage (check recent deploys, examine error logs, check metrics), mitigate first (rollback if a recent deploy correlates; do not wait for root cause), communicate proactively (status update within 15 minutes), identify root cause after mitigation, write a blameless postmortem within 48 hours with specific action items.

"What makes a good postmortem?" Blameless (focuses on system failures, not individuals), has a clear timeline, identifies contributing factors at multiple levels, and produces specific actionable items with owners and due dates. The goal is to prevent the class of incident from recurring, not just to document what happened.

Technical questions

Common SRE technical topics: Kubernetes resource limits and requests, Prometheus alerting rules and label cardinality, distributed tracing (OpenTelemetry), chaos engineering (Chaos Monkey, GameDays), Linux performance tools (top, vmstat, strace, perf), and reading flame graphs for CPU profiling. SRE roles at Google, Meta, and Stripe include LeetCode-style coding rounds alongside the operational topics, so prepare algorithm questions in addition to systems knowledge.

Toil reduction questions

"What is toil and how do you decide what to automate first?" Toil is repetitive, manual, automatable work that scales with service growth. Prioritise by: measure time cost per week, estimate automation effort, calculate payback period. Focus on high-frequency, high-time-cost toil with low automation complexity. Target is keeping toil below 50% of the team's time. Google SRE Book (free online) defines and elaborates this concept — read it before any SRE interview, as interviewers draw directly from it.

Get real-time help in your next interview
Live Interview Help listens to your interview and surfaces personalised answers in real time. Free 20-minute trial on Google Meet, Teams, and Zoom.
Install Free on Chrome

Frequently asked questions

What is the difference between an SRE and a DevOps engineer?
SRE is a specific Google-originated implementation of DevOps principles. SREs typically have stronger software engineering backgrounds and spend significant time building automation and tooling. DevOps engineers tend to focus more on CI/CD pipelines and infrastructure as code. In practice the roles overlap significantly and the distinction is more about company culture than technical content.
Do SRE roles require on-call duty?
Almost always. On-call is core to the SRE role. Most mature SRE organisations maintain rotations with defined escalation paths and compensation for on-call shifts. Excessive on-call load (more than 25% of work time on incidents) signals the system is not reliable enough or the team is under-resourced.