Data science interviews span four distinct areas: statistics, machine learning, coding (SQL and Python), and communication/business understanding. Few candidates are equally strong across all four. Knowing which areas are weighted for a specific role — and preparing accordingly — is more effective than trying to be strong everywhere.
How data science interviews are structured
A typical data science interview process includes:
- A take-home assignment or coding screen (often before the full interview)
- A technical round covering statistics, ML fundamentals, and coding
- A product/business case round (especially in tech companies)
- A behavioural round covering past project work
- Possibly a system design or ML design round for senior roles
The balance shifts by company type: tech companies weight coding and ML heavier; finance and insurance roles weight statistics; consulting roles weight business communication; healthcare analytics roles weight domain knowledge and data governance.
Statistics and probability questions
"What's the difference between a Type I and Type II error?" Type I (false positive): rejecting the null hypothesis when it's true. Type II (false negative): failing to reject the null hypothesis when it's false. In a medical test context: Type I means diagnosing a healthy person as sick; Type II means missing a sick patient.
"How do you choose between a parametric and non-parametric test?" Parametric tests assume specific distributions (usually normal) and are more powerful when those assumptions hold. Non-parametric tests make fewer assumptions and are appropriate for small samples, ordinal data, or distributions that violate normality assumptions.
"Explain p-values to a non-technical stakeholder." The p-value tells you the probability of getting a result at least as extreme as the one you observed, assuming the null hypothesis is true. A small p-value (commonly below 0.05) suggests the result is unlikely to have occurred by chance. Note: p-values are commonly misunderstood — they don't measure the probability that the null hypothesis is true.
"What's the difference between correlation and causation?" Correlation means two variables move together. Causation means one causes the other. Correlation can arise from coincidence, from a third variable affecting both, or from genuine causation. Establishing causation typically requires controlled experimentation (A/B test, RCT) or careful natural experiment design.
Machine learning questions
"Explain overfitting. How do you detect and address it?" Overfitting is when a model learns the training data too well, including noise, and generalises poorly to new data. Signs: large gap between training accuracy and validation accuracy. Solutions: more training data, regularisation (L1/L2), dropout, cross-validation, simplifying the model, or early stopping.
"When would you choose a random forest over logistic regression?" When the relationship between features and outcome is non-linear, when feature interactions matter, and when you have many features including some irrelevant ones. Logistic regression is preferred when interpretability is critical, when you have limited data, or when you need well-calibrated probabilities.
"How do you handle class imbalance?" Techniques: resampling (SMOTE oversampling, undersampling the majority class), using class weights in the model, choosing metrics that reflect imbalance (precision-recall, F1, AUC-ROC rather than raw accuracy), threshold tuning.
SQL and coding questions
Be comfortable with: window functions (RANK, ROW_NUMBER, LAG/LEAD), GROUP BY with aggregations, JOINs (especially LEFT JOIN and how NULLs behave), subqueries and CTEs, and date manipulation.
Common question types: "find the nth highest salary," "calculate a 7-day rolling average," "identify users who performed action A before action B," "calculate retention rate." Practice these patterns specifically — they're used repeatedly across companies.
For Python: be confident with pandas DataFrames (groupby, merge, pivot), array operations, and the ability to clean and wrangle a messy dataset. Numpy fundamentals, matplotlib for basic visualisation, and scikit-learn for modelling are standard.
Behavioural and project questions
"Walk me through a data science project you're proud of." Structure: problem framing, data sources and quality, feature engineering decisions, model selection and why, evaluation approach, deployment or business outcome. Have specific metrics. "We improved X by Y%" is far stronger than "the model performed well."
"Tell me about a time your analysis led to a recommendation that wasn't implemented. How did you handle it?" Shows stakeholder awareness. The honest answer usually involves communication gaps, not technical failures. What did you learn about how to make analysis more actionable?
"How do you explain your methodology to non-technical stakeholders?" They're assessing communication skill, not just technical skill. Prepare examples of translating complex analysis into business language: visual summaries, clear recommendations rather than model outputs.