DevOps interviews combine deep technical knowledge with cultural and process questions about how you approach reliability, deployment velocity, and incident response. Many candidates over-prepare on tools and under-prepare on the "why" behind their technical decisions. Interviewers in mature engineering organisations want to understand how you think, not just what commands you know.
How DevOps interviews work
Typical structure for a DevOps engineer interview process:
- Technical screening (often a take-home task: write a CI/CD pipeline, containerise an app, write a Terraform module)
- Systems design round: design a deployment pipeline, a monitoring stack, or a disaster recovery plan
- Technical deep-dive: tools, architecture decisions, incident experience
- Behavioural round: on-call experience, team collaboration, handling production incidents
CI/CD questions
"Explain your ideal CI/CD pipeline." A strong answer walks through stages: source control trigger, build step, unit testing, integration testing, security scanning (SAST/dependency scanning), artifact creation, staging deployment, smoke tests, production deployment (ideally with a progressive rollout or feature flags), and post-deploy monitoring. Adapt to your actual experience — don't describe a pipeline you've never built.
"How do you handle a deployment that causes a production incident?" Rollback mechanism is the first line: feature flag toggle or revert the deployment. Immediate investigation starts in parallel. Post-incident review (blameless postmortem) to find the root cause and prevent recurrence. Key: having rollback capability in the pipeline design, not improvising under pressure.
"What's the difference between blue-green and canary deployments?" Blue-green: two identical environments, traffic switched all at once to the new version. Canary: a percentage of traffic goes to the new version first, rolled out progressively. Blue-green has near-instant rollback; canary allows catching issues before full exposure. Which to choose depends on risk tolerance and infrastructure cost.
Containers and orchestration questions
"What's the difference between a Docker image and a container?" An image is a read-only template — a snapshot of a filesystem and application. A container is a running instance of that image, with its own isolated process space and writable layer. Multiple containers can run from the same image.
"How does Kubernetes achieve high availability?" Replication controllers and replica sets ensure a specified number of pod replicas are always running. The scheduler redistributes pods if a node fails. Health checks (liveness and readiness probes) restart unhealthy containers. Combined with pod disruption budgets and horizontal pod autoscaling, the cluster can absorb node failures and load spikes.
"When would you use a sidecar container pattern?" When you need a secondary container to augment the main container without changing it — e.g., a log shipper collecting application logs, a service mesh proxy (like Envoy in Istio), or a secret-refresh agent. Avoids coupling infrastructure concerns into the application image.
Infrastructure as code questions
"What's the benefit of infrastructure as code over manual provisioning?" Reproducibility, version control, auditability, and consistency across environments. Infrastructure changes can be reviewed in pull requests, rolled back with git, and automatically validated. Manual provisioning is error-prone and creates undocumented state drift.
"How do you manage Terraform state?" Remote state stored in S3 (with DynamoDB for state locking to prevent concurrent modifications) or Terraform Cloud. Never commit state files to version control — they contain sensitive values. State locking is essential in team environments.
"What's the difference between Ansible and Terraform?" Terraform is primarily for provisioning cloud infrastructure (declarative, idempotent, state-aware). Ansible is primarily for configuration management and application deployment (procedural in execution order, agentless). They're often used together: Terraform provisions the infrastructure, Ansible configures what runs on it.
Monitoring and incident management
"What would you include in a production monitoring stack?" Metrics (Prometheus or CloudWatch), logs (ELK stack or Loki/Grafana), traces (OpenTelemetry or Datadog APM), alerting (PagerDuty or Opsgenie), and dashboards (Grafana). The four golden signals: latency, traffic, errors, saturation.
"How do you distinguish between a symptom and a cause during an incident?" Symptoms are what you observe in monitoring (latency spike, error rate increase, memory saturation). Root causes are what explain why. Drilling from symptom to cause requires structured investigation: correlate timing, check recent deployments, examine dependency health, trace specific requests. The symptom is what pages you; the cause is what you fix.
Behavioural questions
"Tell me about a significant production incident you were involved in." Use a specific example: timeline, your role in response, how the team coordinated, what the root cause was, and what changed afterwards. Show you contributed to the postmortem and the fix, not just the firefighting.
"How do you approach on-call rotations and managing personal sustainability?" Shows maturity and awareness of burnout risk. Strong answers include: setting up runbooks so alerts are actionable, actively improving alert signal-to-noise ratio, and advocating for reducing unnecessary pages rather than just enduring them.