The foundation for ML interviews: bias-variance tradeoff, tree ensembles, evaluation at scale, class imbalance, feature engineering, and data leakage. Understand the why, not just the API calls.
The single most important framework for diagnosing model problems.
| Term | What it means | Symptom |
|---|---|---|
| Bias | Model too simple to capture the pattern | High training error and high validation error |
| Variance | Model too sensitive to training data noise | Low training error, high validation error (overfitting) |
| Irreducible error | Noise in the data itself — cannot be reduced | A floor on any model's error |
The tradeoff: Reducing bias typically increases variance and vice versa. The goal is to find the sweet spot.
Error
│
│ Total Error = Bias² + Variance + Irreducible
│
│ ↑ Bias²
│ \ ← sweet spot → ↗ Variance
│ \___________________________/
│
└─────────────────────────────────────── Model Complexity
(underfit) (overfit)
In interviews: When asked "your model performs badly," always diagnose:
- Both train/val bad → bias problem → more features, bigger model, less regularization
- Train good/val bad → variance problem → more data, regularization, simpler model
Techniques to prevent overfitting by penalizing model complexity.
| Technique | How it works | Effect | Use case |
|---|---|---|---|
| L2 (Ridge) | Add λ·Σw² to loss |
Shrinks all weights toward zero, never to zero | Default choice; handles correlated features well |
| L1 (Lasso) | Add λ·Σ|w| to loss |
Drives many weights to exactly zero (sparse) | Feature selection; when you suspect many irrelevant features |
| Elastic Net | L1 + L2 combined | Sparse + grouped selection | High-dimensional data with correlated features |
| Dropout | Randomly zero out neurons during training | Ensemble effect, prevents co-adaptation | Neural networks |
| Early stopping | Stop when validation loss stops improving | Implicit regularization | Any iterative training |
Key insight: L1 is sparsity-inducing because the gradient of |w| is ±1 regardless of w's magnitude — it applies the same force whether w = 0.001 or w = 10. L2's gradient 2λw goes to zero as w approaches zero, so it never fully eliminates weights.
Choosing the wrong metric is as bad as choosing the wrong model.
| Metric | Formula | When to use |
|---|---|---|
| Accuracy | (TP+TN)/(Total) | Balanced classes only; misleading otherwise |
| Precision | TP/(TP+FP) | When false positives are costly (spam filter — don't flag legit emails) |
| Recall | TP/(TP+FN) | When false negatives are costly (cancer detection — don't miss disease) |
| F1-Score | 2·(P·R)/(P+R) | Balance precision and recall; imbalanced classes |
| ROC-AUC | Area under ROC curve | Model ranking ability across all thresholds; good for imbalanced |
| PR-AUC | Area under Precision-Recall curve | Better than ROC-AUC for severely imbalanced datasets |
| Log Loss | -Σ y·log(p) | When calibrated probabilities matter (e.g., click-through rate) |
Predicted
Positive | Negative
Actual Pos | TP | FN ← recall = TP/(TP+FN)
Actual Neg | FP | TN
↑
precision = TP/(TP+FP)
| Metric | Formula | Notes |
|---|---|---|
| MAE | mean(|y - ŷ|) | Robust to outliers; same units as target |
| RMSE | sqrt(mean((y-ŷ)²)) | Penalizes large errors more; sensitive to outliers |
| R² | 1 - SS_res/SS_tot | Fraction of variance explained; 1.0 is perfect |
| MAPE | mean(|y-ŷ|/y)·100 | Percentage error; fails when y=0 |
Interview tip: Always ask "what's the cost of each error type?" before picking a metric.
Trees split data by the feature/threshold that maximizes information gain (classification) or minimizes MSE (regression).
- Splitting criteria:
- Gini impurity:
1 - Σp²ᵢ— computationally cheaper - Information Gain / Entropy:
-Σpᵢ·log(pᵢ)— slightly better splits
- Gini impurity:
- Overfitting control:
max_depth,min_samples_leaf,min_samples_split
Understanding the distinction between these two ensemble strategies is one of the most common ML interview topics.
| Dimension | Bagging | Boosting |
|---|---|---|
| Build order | Parallel — independent models | Sequential — each model learns from previous errors |
| Goal | Reduce variance | Reduce bias |
| Data sampling | Bootstrap (random with replacement) | Weighted — misclassified samples get higher weight |
| Combining | Uniform average / majority vote | Weighted vote (better models count more) |
| Overfitting risk | Low — averaging smooths noise | Higher — can overfit noisy data |
| Best for | High-variance models (deep trees) | High-bias models (shallow trees / stumps) |
| Main algorithm | Random Forest | AdaBoost, Gradient Boosting, XGBoost |
Intuition: Bagging is like asking 100 independent experts and taking a majority vote — outlier opinions cancel out. Boosting is like having each new expert focus specifically on the cases the previous experts got wrong.
Bagging:
Data → [Sample 1] → Tree 1 ─┐
→ [Sample 2] → Tree 2 ─┤→ Average/Vote → Final prediction
→ [Sample 3] → Tree 3 ─┘
Boosting:
Data → Tree 1 → Find errors → reweight
→ Tree 2 (focuses on errors of Tree 1) → Find errors → reweight
→ Tree 3 (focuses on errors of Trees 1+2) → ...
→ Weighted combination → Final prediction
AdaBoost in brief: After each weak learner, increase weights of misclassified samples exponentially. Each learner's vote is weighted by 0.5 · ln((1-error)/error) — better learners get louder voices.
Why does boosting reduce bias? Each sequential model explicitly targets the residual error of the current ensemble. The final model can capture complex patterns that a single high-bias model cannot.
Bagging (Bootstrap Aggregating) + random feature selection = Random Forest.
Training data
│ Bootstrap sample (with replacement) × n_estimators
▼
[Tree 1] [Tree 2] ... [Tree N] ← each trained on ~63% of data
│ │ │ each split: random subset of features
└────────┴───────────────┘
│
Majority vote (classification) / Average (regression)
- Key hyperparameters:
n_estimators,max_features(sqrt for classification, log2 or all for regression),max_depth - Out-of-bag (OOB) error: ~37% of data not used per tree → free validation signal
- Feature importance: Mean decrease in impurity across all trees
Boosting builds trees sequentially, each one correcting the errors of the previous ensemble.
F₀(x) = base prediction (e.g., mean)
F₁(x) = F₀(x) + η·h₁(x) ← h₁ fits residuals of F₀
F₂(x) = F₁(x) + η·h₂(x) ← h₂ fits residuals of F₁
...
Where η = learning rate (shrinkage).
| Random Forest | Gradient Boosting | |
|---|---|---|
| Training | Parallel | Sequential |
| Speed | Faster to train | Slower; XGBoost/LightGBM are highly optimized |
| Performance | Good baseline | Usually wins on tabular data |
| Overfitting | Hard to overfit | Can overfit; needs careful tuning |
| Key params | n_estimators, max_features |
n_estimators, learning_rate, max_depth, subsample |
XGBoost vs LightGBM:
| XGBoost | LightGBM | |
|---|---|---|
| Split strategy | Level-wise (breadth-first) | Leaf-wise (best-first) |
| Speed | Solid | Faster for large datasets |
| Memory | Higher | Lower (histogram binning) |
| Best for | General purpose | Large datasets, high cardinality categoricals |
A probabilistic classifier based on Bayes' theorem with the naive assumption that features are conditionally independent given the class.
P(y|x₁,...,xₙ) ∝ P(y) · ∏ P(xᵢ|y)
The independence assumption is almost never true in practice (e.g., word frequencies in text are correlated), yet Naive Bayes works surprisingly well because:
- Classification only needs the most probable class, not calibrated probabilities
- Errors in individual feature probabilities often cancel out
- With limited data, fewer parameters to estimate = lower variance
| Variant | Distribution assumed | Use case |
|---|---|---|
| Gaussian NB | Features are continuous, normally distributed | Real-valued features |
| Multinomial NB | Features are counts (word frequencies) | Text classification (bag-of-words, TF-IDF) |
| Bernoulli NB | Features are binary (word present/absent) | Short text, binary features |
| Complement NB | Corrects for imbalanced classes | Imbalanced text classification |
- Text classification (spam detection, sentiment) — the classic use case
- Very fast training and inference: O(n·d) where n = samples, d = features
- Works well with high-dimensional sparse data (thousands of features, few samples)
- Strong baseline that's hard to beat with small datasets
- Highly correlated features violate the independence assumption badly
- Cannot learn feature interactions (unlike trees or neural networks)
- Probability estimates are poorly calibrated (confidences are extreme)
Interview tip: Naive Bayes is the go-to answer for "What's a simple baseline for text classification?" It's also a great example of a high-bias, low-variance model — complements the bias-variance discussion.
A non-parametric, instance-based algorithm: store all training data, classify new points by majority vote of their k nearest neighbors.
1. Compute distance from query point to all training points
2. Select k nearest neighbors
3. Classification: majority vote | Regression: average of k values
| Metric | Formula | Use case |
|---|---|---|
| Euclidean | √(Σ(xᵢ - yᵢ)²) |
Default; continuous features |
| Manhattan | `Σ | xᵢ - yᵢ |
| Cosine | 1 - (x·y)/(‖x‖·‖y‖) |
Text/embeddings (direction matters, not magnitude) |
| Minkowski | `(Σ | xᵢ-yᵢ |
- No training phase — all computation at inference ("lazy learning")
- Feature scaling is critical — features with larger ranges dominate distance calculations
- Curse of dimensionality — in high dimensions, all points become equidistant; KNN breaks down
- Choosing k: Small k → low bias, high variance (noisy). Large k → high bias, low variance (smooth). Use cross-validation. Odd k avoids ties in binary classification.
- Small datasets where simplicity matters
- Non-linear decision boundaries
- As a baseline before trying complex models
- When interpretability via examples matters ("here are the 5 most similar cases")
- Large datasets (inference is O(n·d) per query)
- High-dimensional data (curse of dimensionality)
- When features are on different scales (unless you scale first)
Optimization: KD-Trees or Ball Trees reduce search from O(n) to O(log n) in low dimensions. For high-dimensional approximate search, use ANN algorithms (FAISS, HNSW).
SVMs find the maximum-margin hyperplane separating classes. The margin is the gap between the plane and the nearest data points (support vectors).
○ ○ ○
○ ←── Support vectors
───────────── ← optimal hyperplane (max margin)
● ←── Support vectors
● ● ●
Real data is rarely linearly separable. Kernels map data to higher dimensions without computing the transformation explicitly.
| Kernel | Formula | Use case |
|---|---|---|
| Linear | x·x' |
Linearly separable data; text classification (high-dim sparse) |
| RBF (Gaussian) | exp(-γ‖x-x'‖²) |
Most common; non-linear; sensitive to feature scaling |
| Polynomial | (γx·x'+r)^d |
Image features, NLP |
Key hyperparameters:
C: Regularization. High C → hard margin (low bias, high variance). Low C → soft margin (high bias, low variance).γ(RBF): High γ → small influence radius → complex boundaries (overfit). Low γ → smoother.
When to use SVM: Small-to-medium datasets, high-dimensional data (text), when interpretability is not required.
Despite the name, it's a classification algorithm. The output is a probability via the sigmoid function.
P(y=1|x) = σ(w·x + b) = 1 / (1 + exp(-(w·x + b)))
- Loss: Binary Cross-Entropy =
-y·log(p) - (1-y)·log(1-p) - Multiclass: Softmax extends this:
P(y=k|x) = exp(wₖ·x) / Σⱼ exp(wⱼ·x)
Why use it?
- Highly interpretable: coefficient = log-odds change per unit increase in feature
- Fast, scales to millions of samples
- Calibrated probabilities out of the box
- Strong baseline for any classification problem
Assumptions: Linearly separable classes (in feature space), no multicollinearity, features roughly scaled.
Unsupervised algorithm that partitions n observations into k clusters by minimizing within-cluster variance.
Algorithm:
1. Initialize k centroids randomly (or K-means++ for better initialization)
2. Assign each point to nearest centroid
3. Recompute centroids as mean of assigned points
4. Repeat 2-3 until convergence
Choosing k — The Elbow Method: Plot inertia (within-cluster sum of squares) vs k. The "elbow" is where adding more clusters gives diminishing returns. Also use Silhouette Score (−1 to 1, higher = better separation).
Limitations:
- Assumes spherical, equal-sized clusters
- Sensitive to outliers (consider K-Medoids instead)
- Must specify k upfront
- Fails on non-convex cluster shapes (use DBSCAN for those)
PCA finds the directions (principal components) of maximum variance in the data and projects data onto them.
Original features (high-dim) → PCA transformation → New axes (lower-dim)
How it works:
- Standardize features (zero mean, unit variance)
- Compute covariance matrix
- Eigen-decompose to get eigenvectors (principal components) and eigenvalues (variance explained)
- Keep top-k components
Key concepts:
- Explained variance ratio: How much total variance does each component capture? Choose k where cumulative variance ≥ 95%.
- Linear transformation only: PCA cannot capture non-linear structure (use t-SNE / UMAP for visualization)
When to use: Remove multicollinearity before linear models, speed up training, reduce storage. Not great for interpretability.
Robust way to estimate generalization performance without wasting data.
| Method | How it works | Use case |
|---|---|---|
| k-Fold CV | Split into k folds; train on k-1, test on 1; repeat | Standard; k=5 or 10 |
| Stratified k-Fold | Preserve class distribution in each fold | Imbalanced classification |
| Leave-One-Out (LOO) | k = n; expensive but lowest bias | Very small datasets |
| Time-Series Split | Only use past data to predict future | Any temporal data — critical to avoid leakage |
Train / Validation / Test split philosophy:
- Train: model learns from this
- Validation: hyperparameter tuning, model selection
- Test: final unbiased evaluation — touch it once
Real-world datasets are often skewed (e.g., fraud detection: 0.1% positive).
| Strategy | How | When to use |
|---|---|---|
| Class weights | Penalize minority errors more in loss: weight = n_majority / n_minority |
Always try first — zero cost |
| Oversampling (SMOTE) | Synthesize new minority samples by interpolating between existing ones | Moderate imbalance; avoid for high-dimensional data |
| Undersampling | Randomly remove majority samples | When majority class is huge; risks losing information |
| Threshold tuning | Move decision boundary from 0.5 to optimize precision/recall | Any probabilistic classifier |
| Focal Loss | Down-weight easy examples so training focuses on hard ones | Neural networks with severe imbalance (object detection, fraud) |
| Use PR-AUC / F1 | ROC-AUC hides imbalance problems | Always with imbalanced data |
SMOTE in brief: For each minority sample, find k nearest minority neighbors, generate synthetic sample along the connecting line segment.
Focal Loss in depth: Standard cross-entropy treats all examples equally. With a 99:1 imbalance, the model learns to predict "majority" for everything and still achieves low loss — the minority signal drowns out.
Standard CE: L = -log(p)
Focal Loss: L = -α · (1 - p)^γ · log(p)
p= predicted probability for the correct class(1 - p)^γ= modulating factor — whenpis high (easy example, already well-classified), this term → 0, effectively ignoring itγ(focusing parameter, typically 2): higher γ = stronger down-weighting of easy examplesα(class weight): balances frequency difference, same asclass_weightin sklearn
Concrete effect: With γ = 2, an easy example with p = 0.9 gets loss weight (1-0.9)² = 0.01 — 100× less than a hard example with p = 0.1. The network spends its gradient budget learning the hard, rare cases.
Originally introduced in RetinaNet for object detection (background vastly outnumbers foreground objects). Works for any neural network loss function — not available for tree models.
Often more impactful than algorithm choice.
| Method | How | Use case |
|---|---|---|
| One-Hot Encoding | Binary column per category | Low cardinality (< 20 categories), linear models |
| Label Encoding | Map category to integer | Tree models only (they handle arbitrary orderings) |
| Target Encoding | Replace category with mean target value | High cardinality; use with regularization to avoid leakage |
| Hashing | Map to fixed-size vector via hash function | Very high cardinality, online learning |
The mechanism of missingness matters as much as the strategy.
Missing data mechanisms:
- MCAR (Missing Completely At Random): Missingness is unrelated to any variable — a sensor randomly fails. Safe to impute simply.
- MAR (Missing At Random): Missingness depends on observed variables — older people skip income questions, but given age, the missingness is random. Can impute using other features.
- MNAR (Missing Not At Random): Missingness depends on the missing value itself — rich people hide income. Simple imputation is biased; need domain knowledge or model the missingness.
| Strategy | How | When |
|---|---|---|
| Mean imputation | Replace with column mean | MCAR numeric; fast baseline; distorts distribution |
| Median imputation | Replace with column median | MCAR numeric with outliers — more robust than mean |
| Mode imputation | Replace with most frequent value | MCAR categorical |
| KNN imputation | Use k nearest neighbors' values | MAR; preserves local relationships; slow at scale |
| Iterative imputation (MICE) | Regress each missing column on others; iterate | MAR; best quality; expensive |
| Model imputation | Train a model to predict missing column | MNAR; complex but powerful |
| Add indicator column | Binary flag: col_missing = 1 if NaN |
When missingness itself is predictive (often is) |
| Drop rows | Remove rows with missing values | Only if < 1-5% missing AND MCAR AND you have enough data |
| Tree-native handling | XGBoost/LightGBM learn optimal direction for NaN | Best for tree models — don't impute, just pass NaN |
Critical rule: Fit imputers on training data only, then apply to validation/test. Fitting on the full dataset leaks test distribution — a subtle form of data leakage.
# Correct
imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
X_val_imp = imputer.transform(X_val) # use train statistics only
# Wrong — data leakage
imputer.fit(X_all) # sees test data statistics- StandardScaler: Zero mean, unit variance. Required for: SVM, logistic regression, PCA, KNN.
- MinMaxScaler: Scale to [0,1]. Use when data has bounded range or for neural networks.
- RobustScaler: Uses median and IQR. Use when outliers are present.
- Trees don't need scaling — splits are monotone-invariant.
Data leakage occurs when information from outside the training window "leaks" into the model, producing optimistically biased evaluation metrics that don't hold in production. It's one of the most common and costly mistakes in applied ML.
1. Target Leakage (most common) A feature that is a direct consequence of the target — it's only available after the outcome is known.
Example: Predicting loan default
Leaked feature: "loan_was_restructured" (only happens after default)
Without it: model seems mediocre
With it: model gets 99% accuracy in training; 50% in production
2. Train-Test Contamination Test data influences the training pipeline in any way.
| Leaky practice | Why it's wrong | Fix |
|---|---|---|
StandardScaler.fit(X_all) |
Test mean/std leaks into train scaling | Fit on train only |
imputer.fit(X_all) |
Test distribution leaks into imputation | Fit on train only |
| Feature selection on full dataset | Model "knows" test feature variance | Select inside CV fold |
| Hyperparameter search on test set | Overfits test set | Use held-out validation |
| SMOTE on full dataset before split | Synthetic samples from test in training | SMOTE only on train fold |
3. Temporal Leakage Using future data to predict the past — extremely common with time-series.
Example: Predicting stock price tomorrow using a moving average
Leaky: 7-day moving average includes days after prediction date
Correct: Only use data strictly before prediction timestamp
4. Group Leakage Data from the same entity appears in both train and test.
Example: Medical image classification
Leaky: 5 images from same patient in both train and test
Correct: Split by patient_id, not by image
- Suspiciously high validation accuracy (> 99% on a hard problem)
- Feature importance dominated by a single feature you wouldn't have at prediction time
- Temporal patterns: feature timestamp is after event timestamp
- Drop feature → big accuracy drop: strong sign it's leaky
- Shuffle test labels → still high accuracy: model learned from index/order, not features
□ Split train/test BEFORE any preprocessing
□ Fit all transformers (scaler, imputer, encoder) on train fold only
□ For time-series: always split by time, not randomly
□ For groups (patients, users): split by group_id
□ Ask for every feature: "Is this available at prediction time?"
□ Wrap preprocessing in a Pipeline to enforce train-only fitting
□ SMOTE/oversampling only inside training fold
When your dataset has many features (p >> n, or p in the thousands), naive ML breaks down.
- Curse of dimensionality: In high-dim space, all points become equidistant → KNN, clustering, distance-based methods fail
- Overfitting: More features than samples → model memorizes training data
- Multicollinearity: Highly correlated features → unstable coefficient estimates for linear models
- Computational cost: Training and inference slow down
- Irrelevant features: Noise features can hurt performance more than they help
Remove features before any model is trained. Fast, model-agnostic.
| Method | How | Best for |
|---|---|---|
| Variance threshold | Remove near-zero variance features | Any — quick cleanup |
| Correlation filter | Remove one of any two features with |corr| > 0.95 | Linear models |
| Chi-squared test | For categorical features vs categorical target | Text, categoricals |
| Mutual Information | Non-linear dependency between feature and target | Any type |
| ANOVA F-test | Linear association between numeric feature and target | Numeric features |
Use a model's performance to select features. More accurate but expensive.
| Method | How |
|---|---|
| RFE (Recursive Feature Elimination) | Train model, remove lowest-importance feature, repeat |
| Forward selection | Start with 0 features, add one at a time (best gain) |
| Backward elimination | Start with all features, remove one at a time |
Feature selection is part of the model fitting.
| Method | How |
|---|---|
| L1 Regularization (Lasso) | Drives irrelevant feature weights to exactly 0 |
| Tree-based importance | Random Forest / XGBoost: use feature importance scores |
| ElasticNet | L1 + L2 — sparse but handles correlated features |
Transform features into a smaller set of derived features.
| Method | Type | Use case |
|---|---|---|
| PCA | Linear | Remove multicollinearity; speed up linear models |
| t-SNE | Non-linear | Visualization only — not for preprocessing |
| UMAP | Non-linear | Visualization + preprocessing; faster than t-SNE |
| Autoencoders | Deep learning | Complex non-linear structure; image/text features |
| TruncatedSVD (LSA) | Linear | Sparse matrices (text/TF-IDF); memory-efficient |
p > 10,000 features?
├── Yes → Start with variance threshold + correlation filter (fast wins)
│ → Then L1 / RFE / tree importance
│ → PCA as last resort (loses interpretability)
└── No → Tree-based importance + SHAP for selection
→ Correlation filter if linear model
n < p (more features than samples)?
├── Use regularized models (Ridge, Lasso, ElasticNet)
├── PCA to reduce to n < p before any model
└── Never use unregularized logistic regression or OLS
| Method | How | Pros/Cons |
|---|---|---|
| Grid Search | Try every combination | Exhaustive; exponential in dimensionality |
| Random Search | Sample random combinations | Surprisingly effective; more efficient than grid for >3 params |
| Bayesian Optimization | Build surrogate model of objective, sample where improvement expected | Most sample-efficient; use Optuna or Hyperopt |
| Halving / Successive Elimination | Early-stop bad configs | Fast; built into scikit-learn |
Rule of thumb: Random search beats grid search in high-dimensional parameter spaces (Bergstra & Bengio, 2012). Use Bayesian optimization when evaluation is expensive.
Understanding why a model makes a prediction is increasingly critical — for debugging, regulatory compliance, and stakeholder trust. This is one of the hottest ML interview topics.
| Method | Scope | How it works | Pros | Cons |
|---|---|---|---|---|
| Impurity-based importance | Global | Mean decrease in Gini/entropy across all splits | Fast; built into sklearn | Biased toward high-cardinality features |
| Permutation importance | Global | Shuffle feature, measure accuracy drop | Model-agnostic; unbiased | Misleading with correlated features |
| SHAP (SHapley Additive exPlanations) | Local + Global | Game-theory: contribution of each feature to each prediction | Mathematically grounded; consistent | Expensive for large models |
| LIME (Local Interpretable Model-Agnostic Explanations) | Local | Fit a simple model (linear/tree) on perturbed samples near the point | Fast; intuitive | Explanations can be unstable |
| Partial Dependence Plots (PDP) | Global | Show marginal effect of a feature on prediction | Easy to understand | Assumes feature independence |
| ICE (Individual Conditional Expectation) | Local | PDP for individual instances | Shows heterogeneity | Can be noisy with many lines |
SHAP values are based on Shapley values from cooperative game theory. For each prediction, each feature gets a value representing its contribution to the difference between the prediction and the average prediction.
f(x) = base_value + SHAP(feature₁) + SHAP(feature₂) + ... + SHAP(featureₙ)
Example (loan approval):
Base prediction: 65% approval
+ Income: +20% (high income helps)
+ Credit score: +10% (good score helps)
+ Debt ratio: -15% (high debt hurts)
= Final: 80% approval
Key properties:
- Additivity: SHAP values sum to the difference between prediction and mean
- Consistency: If a feature's contribution increases, its SHAP value never decreases
- Local accuracy: The explanation exactly matches the model output
TreeSHAP: Optimized O(TLD²) algorithm for tree models (vs exponential for exact Shapley). Built into XGBoost and LightGBM.
1. Pick instance to explain
2. Generate perturbed samples around it
3. Get model predictions for perturbed samples
4. Fit a simple interpretable model (weighted by proximity)
5. The simple model's coefficients = explanation
When to use LIME over SHAP: When you need fast, approximate explanations for black-box models (neural nets, APIs) and SHAP is too expensive.
Interview tip: "For tree models, I'd use TreeSHAP (fast, exact). For neural networks or black-box APIs, I'd use LIME or KernelSHAP. For global understanding, I'd plot SHAP summary plots (beeswarm). For regulatory compliance, SHAP's mathematical guarantees (consistency, additivity) are stronger than LIME."
A model's predicted probability should match the true frequency. If a model predicts 80% confidence for 1000 samples, roughly 800 should actually be positive. Many models are not well-calibrated out of the box.
- Decision thresholds: If probabilities are miscalibrated, tuning the classification threshold produces unexpected results
- Risk scoring: In medical, fraud, or insurance applications, probabilities directly drive decisions
- Model combination: Ensembling requires calibrated scores to weight models properly
- Ranking is not enough: A model can rank well (high AUC) but have terrible probability estimates
| Model | Calibration quality | Why |
|---|---|---|
| Logistic Regression | Good (inherently calibrated) | Optimizes log loss directly |
| Random Forest | Poor — overconfident near 0/1 | Votes are discrete fractions |
| XGBoost / GBMs | Moderate — depends on objective | Log loss objective helps |
| Neural Networks | Often overconfident | Modern NNs tend to be miscalibrated |
| SVM | No probabilities by default | Needs Platt scaling to output probs |
| Naive Bayes | Very poor — extreme probabilities | Independence assumption distorts |
| Method | How it works | When to use |
|---|---|---|
| Platt Scaling | Fit a logistic regression on model outputs vs true labels | Binary classification; works well with sigmoid-shaped distortions |
| Isotonic Regression | Fit a non-decreasing step function | More flexible; needs more data (~1000+ samples) |
| Temperature Scaling | Divide logits by learned T before softmax | Neural networks; single parameter; multiclass |
- Reliability diagram: Plot predicted probability vs observed frequency in bins. Perfect calibration = diagonal line.
- Expected Calibration Error (ECE): Weighted average of |predicted_prob - observed_freq| across bins. Lower = better.
- Brier Score:
mean((predicted - actual)²)— combines calibration and discrimination.
Reliability Diagram:
Observed freq
1.0 | · /
| · /
0.5 | · / ← well-calibrated (near diagonal)
| · /
0.0 |·/
└────────────
0.0 1.0
Predicted probability
Interview tip: "I'd check calibration with a reliability diagram and ECE. If the model is overconfident (most are), I'd apply Platt scaling for binary or temperature scaling for multiclass. Calibration is essential when probabilities drive business decisions — 'is this user 90% likely to churn?' needs to actually mean 90%."
Beyond bagging and boosting, stacking (stacked generalization) is a powerful ensemble technique that uses a meta-learner to combine base model predictions.
Training:
Fold 1: Train Models A,B,C on folds 2-5 → predict on fold 1
Fold 2: Train Models A,B,C on folds 1,3-5 → predict on fold 2
...
Result: out-of-fold predictions from each base model
Meta-learner: Train on [pred_A, pred_B, pred_C] → final prediction
Inference:
Input → Model A prediction ─┐
→ Model B prediction ─┤→ Meta-learner → final prediction
→ Model C prediction ─┘
Key rules:
- Base models should be diverse (e.g., Random Forest + XGBoost + Logistic Regression + Neural Net)
- Use out-of-fold predictions for meta-learner training to avoid data leakage
- Meta-learner is typically simple (logistic regression, ridge regression) to avoid overfitting
- Can stack multiple levels (but diminishing returns after 2)
| Bagging | Boosting | Stacking | |
|---|---|---|---|
| Base models | Same type, different data | Same type, sequential | Different types |
| Training | Parallel | Sequential | Two-stage |
| Reduces | Variance | Bias | Both (via diversity) |
| Example | Random Forest | XGBoost | RF + XGB + LR → meta |
Use a simple holdout set (instead of cross-validation) for meta-learner training. Faster but wastes data.
Interview tip: Stacking typically wins Kaggle competitions but adds complexity in production. In interviews, mention it to show depth but note the operational trade-off.
Learning curves plot model performance vs training set size or training iterations. They're the primary diagnostic tool for bias-variance analysis.
Error
│
│ ─── Training error (rises as data grows — harder to memorize)
│
│ ─── Validation error (drops as data grows — better generalization)
│
│ Gap = variance
└─────────────────────────── Training set size
High Bias (underfitting): High Variance (overfitting):
Error Error
│ val ────────────── │ val ──────────
│ │ \ ← large gap
│ train ──────────── │ train ──
│ ↑ both high, close gap │ ↑ small train err, big gap
└────────────────── └──────────────────
Training set size Training set size
| Pattern | Diagnosis | Action |
|---|---|---|
| Both errors high, converging | High bias | More features, bigger model, less regularization |
| Train low, val high, not converging | High variance | More data, regularization, simpler model |
| Both errors low, converging | Good fit | Deploy |
| Train error increasing with data | Normal — harder to memorize more data | Not a problem |
| Val error not improving with more data | Model has learned all it can from this feature set | Better features, different algorithm |
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5
)Interview tip: When asked "how would you debug a model that's performing poorly?", learning curves should be your first tool. They tell you whether you need more data (variance problem) or a better model (bias problem).
"Your model is overfitting, what do you do?" → More training data → add regularization (L2) → reduce model complexity → dropout / early stopping → feature selection to remove noise
"Your model has high bias, what do you do?" → More features / better features → bigger model → reduce regularization → add polynomial/interaction terms
"How do you handle imbalanced data?" → class_weight='balanced' first → tune threshold → SMOTE if needed → use F1/PR-AUC metrics
"Explain XGBoost vs Random Forest" → RF: parallel bagging, each tree independent, robust, slower to improve with more trees. XGBoost: sequential boosting, each tree fixes previous errors, usually higher accuracy on tabular data, more hyperparameters to tune.
"What feature importance methods exist?" → Tree-based (mean decrease impurity — fast but biased toward high cardinality), Permutation importance (model-agnostic, unbiased), SHAP values (local + global, most reliable).
"What is data leakage and how do you prevent it?" → Leakage = information from outside the training window contaminates the model. Types: target leakage (feature only available after the event), train-test contamination (scaler/imputer fit on all data), temporal leakage (using future data), group leakage (same user in train and test). Prevention: split first, then fit transformers on train only, wrap in Pipeline, always ask "is this feature available at prediction time?"
"You have 50,000 features. What do you do?" → First, remove near-zero variance features and high-correlation pairs (fast filter). Then L1 regularization to get sparse weights, or tree-based importance to rank and prune. Use PCA to reduce to manageable dimensions if linear model needed. Never use unregularized models when p >> n.
"Bagging vs Boosting — what does each reduce?" → Bagging reduces variance by averaging independent models (Random Forest). Boosting reduces bias by sequentially fixing errors (XGBoost). Bagging is parallelizable; Boosting is sequential. Use Random Forest as a fast robust baseline; use XGBoost when you need maximum accuracy on tabular data.
"When would you use Naive Bayes?" → Text classification (spam, sentiment) with bag-of-words/TF-IDF features. It's fast (O(n·d) training), works well with high-dimensional sparse data, and gives a strong baseline that's surprisingly hard to beat on small text datasets. Fails when features are highly correlated.
"Explain KNN and its limitations" → Store all training data. Classify by majority vote of k nearest neighbors. No training phase (lazy learning). Main limitations: O(n·d) inference, curse of dimensionality makes distances meaningless in high dimensions, requires feature scaling. Use KD-trees for speedup. Good for small datasets and when you want example-based explanations.
"How do you explain your model's predictions?" → For tree models: TreeSHAP (fast, exact Shapley values). For any model: permutation importance (global) or LIME (local). SHAP is the gold standard — each feature gets a value representing its contribution, values are additive and consistent. For regulatory compliance, SHAP's mathematical guarantees are strongest.
"Your model has good AUC but bad real-world performance" → Likely a calibration problem. Good AUC means the model ranks well, but the predicted probabilities may not be trustworthy. Check with a reliability diagram. Apply Platt scaling (binary) or temperature scaling (multiclass). Calibration matters when probabilities drive business decisions (loan risk, insurance pricing).