Core SRE concepts questions
"Explain SLI, SLO, and SLA." An SLI is the actual measurement: request success rate, p99 latency, availability. An SLO is your internal target: 99.9% success rate. An SLA is a customer contract specifying consequences if the SLO is missed. SLOs should be aspirational but achievable; SLAs should be set below your actual SLO to maintain a buffer before contractual penalties.
"What is an error budget?" The allowable downtime implied by your SLO. At 99.9% uptime, your monthly error budget is 43 minutes. If you have consumed 90% of your budget, slow feature releases and prioritise reliability work. If your budget is healthy, move fast. Error budgets make the reliability trade-off explicit and remove the adversarial dynamic between product and SRE.
Incident management questions
"Walk me through how you would respond to a P1 incident where 30% of users cannot log in." Strong answer: Declare the incident (set up channel, assign commander), triage (check recent deploys, examine error logs, check metrics), mitigate first (rollback if a recent deploy correlates; do not wait for root cause), communicate proactively (status update within 15 minutes), identify root cause after mitigation, write a blameless postmortem within 48 hours with specific action items.
"What makes a good postmortem?" Blameless (focuses on system failures, not individuals), has a clear timeline, identifies contributing factors at multiple levels, and produces specific actionable items with owners and due dates. The goal is to prevent the class of incident from recurring, not just to document what happened.
Technical questions
Common SRE technical topics: Kubernetes resource limits and requests, Prometheus alerting rules and label cardinality, distributed tracing (OpenTelemetry), chaos engineering (Chaos Monkey, GameDays), Linux performance tools (top, vmstat, strace, perf), and reading flame graphs for CPU profiling. SRE roles at Google, Meta, and Stripe include LeetCode-style coding rounds alongside the operational topics, so prepare algorithm questions in addition to systems knowledge.
Toil reduction questions
"What is toil and how do you decide what to automate first?" Toil is repetitive, manual, automatable work that scales with service growth. Prioritise by: measure time cost per week, estimate automation effort, calculate payback period. Focus on high-frequency, high-time-cost toil with low automation complexity. Target is keeping toil below 50% of the team's time. Google SRE Book (free online) defines and elaborates this concept — read it before any SRE interview, as interviewers draw directly from it.