How data engineer interviews work

Data engineer interviews test SQL proficiency, pipeline and ETL design, data modelling, cloud platform knowledge, and behavioral competencies. Most interview loops include a SQL round, a coding round (usually Python), a system design round focused on data architecture, and one or two behavioral interviews. The balance between SQL depth and Python complexity varies by company and team.

Companies that run large internal data platforms weight system design and distributed systems knowledge heavily. Companies building ML infrastructure weight data pipeline reliability and feature engineering experience. Know which environment you are joining and calibrate your preparation accordingly.

SQL questions

"Write a query to find the second highest salary in an employee table." Data engineer SQL rounds go beyond basic SELECT queries. Expect window functions (RANK, DENSE_RANK, ROW_NUMBER, LAG/LEAD), CTEs, self-joins, and aggregation with HAVING. Practice writing clean, readable SQL under time pressure. Explain your approach before you write rather than diving straight in.

"How would you identify duplicate records in a large table and handle them without deleting the original data?" This tests practical data engineering thinking. Use ROW_NUMBER() partitioned by the deduplication key to assign row numbers, then select only rows with row number equal to 1 into a new table or use a flag column. Discuss how you would handle this at scale with partitioned tables.

Pipeline and ETL design questions

"How would you design an ETL pipeline that ingests 10 million rows per day from multiple source systems?" Cover: ingestion pattern (batch vs streaming), data validation and quality checks, transformation logic, loading strategy (full refresh vs incremental), error handling and retry logic, and monitoring. The interviewer wants to see you think through reliability and observability from the start, not as an afterthought.

"What is the difference between a data lake and a data warehouse, and when would you use each?" A data lake stores raw, unprocessed data in its native format at low cost. A data warehouse stores transformed, structured data optimised for query performance and analytics. Many modern architectures use a lakehouse approach combining both. Discuss the tradeoffs in terms of query latency, storage cost, data governance, and use case fit.

Data modelling questions

"Explain the difference between a star schema and a snowflake schema." A star schema has a central fact table connected directly to dimension tables. A snowflake schema normalises dimensions further, splitting them into sub-dimensions. Star schemas are simpler and faster for queries; snowflake schemas save storage and reduce data redundancy but require more joins. Most analytical workloads prefer star schemas for read performance.

"How would you model a user activity table for a mobile app that needs to support both real-time and historical queries?" Discuss partitioning by date for historical queries, event streaming for real-time, and the separation of raw event storage from aggregated summary tables. Show that you think about query patterns when designing the model, not just storage efficiency.

Cloud and tooling questions

Data engineer interviews almost always include questions about cloud platforms. "What is your experience with Spark, and how does it differ from traditional SQL databases for large-scale transformations?" Spark distributes computation across a cluster, making it suited for transformations on data too large for a single machine. Understand the execution model (DAGs, stages, shuffles) and common performance issues like data skew and excessive shuffles.

Know the major cloud data tools: AWS (Redshift, Glue, S3, Athena), GCP (BigQuery, Dataflow, Cloud Storage), Azure (Synapse, Data Factory). You do not need deep expertise in all three but you should be able to discuss the architecture of one stack and compare tradeoffs between managed services and open-source alternatives like Airflow, dbt, and Spark.

Behavioral questions

"Tell me about a data pipeline you built that broke in production and how you handled it." Data engineers deal with pipeline failures regularly. Show that you had monitoring and alerting in place, that you diagnosed the root cause systematically, communicated the impact clearly to stakeholders, and implemented a fix along with safeguards to prevent recurrence. Failures handled well are a stronger story than things that always worked.

"How do you prioritise data quality work versus feature work when the team has limited capacity?" This tests business judgment alongside technical skill. Show that you understand data quality as a foundational requirement for downstream work, that you can quantify the cost of bad data in business terms, and that you know how to make a case for quality investment to a product manager or engineering lead who may be focused on shipping features.

Get real-time help in your next interview
Live Interview Help listens to your interview and surfaces personalised answers in real time. Free 20-minute trial on Google Meet, Teams, and Zoom.
Install Free on Chrome

Frequently asked questions

Is data engineering closer to software engineering or data science?
Data engineering is closer to software engineering in terms of day-to-day work. The job involves writing production code, designing reliable systems, and managing infrastructure. The domain knowledge is data-specific (pipelines, warehouses, schemas) rather than statistical or ML-focused. Strong data engineers think like software engineers applied to the data domain, prioritising reliability, observability, and maintainability.
Do I need to know machine learning for a data engineer role?
For most data engineering roles, you do not need ML expertise. You need to understand how ML models consume data so you can build appropriate feature pipelines, but you are not expected to train or evaluate models yourself. Some companies hire data engineers specifically to build ML infrastructure, in which case more ML knowledge is expected. Check the job description carefully.
What SQL level is expected in data engineer interviews?
Advanced SQL is expected. You should be comfortable with window functions, CTEs, recursive queries, partitioning, and performance optimisation (index selection, query plan reading). Basic SELECT and JOIN queries are not sufficient to pass a DE SQL round at most companies. Practice writing complex multi-step queries involving aggregation, ranking, and time-series analysis.