Skip to content

Latest commit

 

History

History
866 lines (628 loc) · 40.9 KB

File metadata and controls

866 lines (628 loc) · 40.9 KB

Classic Machine Learning

The foundation for ML interviews: bias-variance tradeoff, tree ensembles, evaluation at scale, class imbalance, feature engineering, and data leakage. Understand the why, not just the API calls.


1. Bias-Variance Tradeoff

The single most important framework for diagnosing model problems.

Term What it means Symptom
Bias Model too simple to capture the pattern High training error and high validation error
Variance Model too sensitive to training data noise Low training error, high validation error (overfitting)
Irreducible error Noise in the data itself — cannot be reduced A floor on any model's error

The tradeoff: Reducing bias typically increases variance and vice versa. The goal is to find the sweet spot.

Error
  │
  │ Total Error = Bias² + Variance + Irreducible
  │
  │      ↑ Bias²
  │       \        ← sweet spot →   ↗ Variance
  │        \___________________________/
  │
  └─────────────────────────────────────── Model Complexity
   (underfit)                            (overfit)

In interviews: When asked "your model performs badly," always diagnose:

  • Both train/val bad → bias problem → more features, bigger model, less regularization
  • Train good/val bad → variance problem → more data, regularization, simpler model

2. Regularization

Techniques to prevent overfitting by penalizing model complexity.

Technique How it works Effect Use case
L2 (Ridge) Add λ·Σw² to loss Shrinks all weights toward zero, never to zero Default choice; handles correlated features well
L1 (Lasso) Add λ·Σ|w| to loss Drives many weights to exactly zero (sparse) Feature selection; when you suspect many irrelevant features
Elastic Net L1 + L2 combined Sparse + grouped selection High-dimensional data with correlated features
Dropout Randomly zero out neurons during training Ensemble effect, prevents co-adaptation Neural networks
Early stopping Stop when validation loss stops improving Implicit regularization Any iterative training

Key insight: L1 is sparsity-inducing because the gradient of |w| is ±1 regardless of w's magnitude — it applies the same force whether w = 0.001 or w = 10. L2's gradient 2λw goes to zero as w approaches zero, so it never fully eliminates weights.


3. Evaluation Metrics

Choosing the wrong metric is as bad as choosing the wrong model.

Classification Metrics

Metric Formula When to use
Accuracy (TP+TN)/(Total) Balanced classes only; misleading otherwise
Precision TP/(TP+FP) When false positives are costly (spam filter — don't flag legit emails)
Recall TP/(TP+FN) When false negatives are costly (cancer detection — don't miss disease)
F1-Score 2·(P·R)/(P+R) Balance precision and recall; imbalanced classes
ROC-AUC Area under ROC curve Model ranking ability across all thresholds; good for imbalanced
PR-AUC Area under Precision-Recall curve Better than ROC-AUC for severely imbalanced datasets
Log Loss -Σ y·log(p) When calibrated probabilities matter (e.g., click-through rate)

The Confusion Matrix

                  Predicted
              Positive | Negative
Actual Pos |    TP    |    FN    ← recall = TP/(TP+FN)
Actual Neg |    FP    |    TN
                ↑
          precision = TP/(TP+FP)

Regression Metrics

Metric Formula Notes
MAE mean(|y - ŷ|) Robust to outliers; same units as target
RMSE sqrt(mean((y-ŷ)²)) Penalizes large errors more; sensitive to outliers
1 - SS_res/SS_tot Fraction of variance explained; 1.0 is perfect
MAPE mean(|y-ŷ|/y)·100 Percentage error; fails when y=0

Interview tip: Always ask "what's the cost of each error type?" before picking a metric.


4. Decision Trees & Ensembles

Decision Trees

Trees split data by the feature/threshold that maximizes information gain (classification) or minimizes MSE (regression).

  • Splitting criteria:
    • Gini impurity: 1 - Σp²ᵢ — computationally cheaper
    • Information Gain / Entropy: -Σpᵢ·log(pᵢ) — slightly better splits
  • Overfitting control: max_depth, min_samples_leaf, min_samples_split

Bagging vs Boosting — The Core Difference

Understanding the distinction between these two ensemble strategies is one of the most common ML interview topics.

Dimension Bagging Boosting
Build order Parallel — independent models Sequential — each model learns from previous errors
Goal Reduce variance Reduce bias
Data sampling Bootstrap (random with replacement) Weighted — misclassified samples get higher weight
Combining Uniform average / majority vote Weighted vote (better models count more)
Overfitting risk Low — averaging smooths noise Higher — can overfit noisy data
Best for High-variance models (deep trees) High-bias models (shallow trees / stumps)
Main algorithm Random Forest AdaBoost, Gradient Boosting, XGBoost

Intuition: Bagging is like asking 100 independent experts and taking a majority vote — outlier opinions cancel out. Boosting is like having each new expert focus specifically on the cases the previous experts got wrong.

Bagging:
Data → [Sample 1] → Tree 1 ─┐
     → [Sample 2] → Tree 2 ─┤→ Average/Vote → Final prediction
     → [Sample 3] → Tree 3 ─┘

Boosting:
Data → Tree 1 → Find errors → reweight
     → Tree 2 (focuses on errors of Tree 1) → Find errors → reweight
     → Tree 3 (focuses on errors of Trees 1+2) → ...
     → Weighted combination → Final prediction

AdaBoost in brief: After each weak learner, increase weights of misclassified samples exponentially. Each learner's vote is weighted by 0.5 · ln((1-error)/error) — better learners get louder voices.

Why does boosting reduce bias? Each sequential model explicitly targets the residual error of the current ensemble. The final model can capture complex patterns that a single high-bias model cannot.


Random Forests

Bagging (Bootstrap Aggregating) + random feature selection = Random Forest.

Training data
     │  Bootstrap sample (with replacement) × n_estimators
     ▼
[Tree 1]  [Tree 2]  ...  [Tree N]   ← each trained on ~63% of data
     │        │               │        each split: random subset of features
     └────────┴───────────────┘
                   │
              Majority vote (classification) / Average (regression)
  • Key hyperparameters: n_estimators, max_features (sqrt for classification, log2 or all for regression), max_depth
  • Out-of-bag (OOB) error: ~37% of data not used per tree → free validation signal
  • Feature importance: Mean decrease in impurity across all trees

Gradient Boosting (XGBoost / LightGBM)

Boosting builds trees sequentially, each one correcting the errors of the previous ensemble.

F₀(x) = base prediction (e.g., mean)
F₁(x) = F₀(x) + η·h₁(x)    ← h₁ fits residuals of F₀
F₂(x) = F₁(x) + η·h₂(x)    ← h₂ fits residuals of F₁
...

Where η = learning rate (shrinkage).

Random Forest Gradient Boosting
Training Parallel Sequential
Speed Faster to train Slower; XGBoost/LightGBM are highly optimized
Performance Good baseline Usually wins on tabular data
Overfitting Hard to overfit Can overfit; needs careful tuning
Key params n_estimators, max_features n_estimators, learning_rate, max_depth, subsample

XGBoost vs LightGBM:

XGBoost LightGBM
Split strategy Level-wise (breadth-first) Leaf-wise (best-first)
Speed Solid Faster for large datasets
Memory Higher Lower (histogram binning)
Best for General purpose Large datasets, high cardinality categoricals

5. Naive Bayes

A probabilistic classifier based on Bayes' theorem with the naive assumption that features are conditionally independent given the class.

P(y|x₁,...,xₙ) ∝ P(y) · ∏ P(xᵢ|y)

Why "Naive"?

The independence assumption is almost never true in practice (e.g., word frequencies in text are correlated), yet Naive Bayes works surprisingly well because:

  • Classification only needs the most probable class, not calibrated probabilities
  • Errors in individual feature probabilities often cancel out
  • With limited data, fewer parameters to estimate = lower variance

Variants

Variant Distribution assumed Use case
Gaussian NB Features are continuous, normally distributed Real-valued features
Multinomial NB Features are counts (word frequencies) Text classification (bag-of-words, TF-IDF)
Bernoulli NB Features are binary (word present/absent) Short text, binary features
Complement NB Corrects for imbalanced classes Imbalanced text classification

When Naive Bayes Shines

  • Text classification (spam detection, sentiment) — the classic use case
  • Very fast training and inference: O(n·d) where n = samples, d = features
  • Works well with high-dimensional sparse data (thousands of features, few samples)
  • Strong baseline that's hard to beat with small datasets

When It Fails

  • Highly correlated features violate the independence assumption badly
  • Cannot learn feature interactions (unlike trees or neural networks)
  • Probability estimates are poorly calibrated (confidences are extreme)

Interview tip: Naive Bayes is the go-to answer for "What's a simple baseline for text classification?" It's also a great example of a high-bias, low-variance model — complements the bias-variance discussion.


6. K-Nearest Neighbors (KNN)

A non-parametric, instance-based algorithm: store all training data, classify new points by majority vote of their k nearest neighbors.

1. Compute distance from query point to all training points
2. Select k nearest neighbors
3. Classification: majority vote    |    Regression: average of k values

Distance Metrics

Metric Formula Use case
Euclidean √(Σ(xᵢ - yᵢ)²) Default; continuous features
Manhattan xᵢ - yᵢ
Cosine 1 - (x·y)/(‖x‖·‖y‖) Text/embeddings (direction matters, not magnitude)
Minkowski `(Σ xᵢ-yᵢ

Key Properties

  • No training phase — all computation at inference ("lazy learning")
  • Feature scaling is critical — features with larger ranges dominate distance calculations
  • Curse of dimensionality — in high dimensions, all points become equidistant; KNN breaks down
  • Choosing k: Small k → low bias, high variance (noisy). Large k → high bias, low variance (smooth). Use cross-validation. Odd k avoids ties in binary classification.

When to Use KNN

  • Small datasets where simplicity matters
  • Non-linear decision boundaries
  • As a baseline before trying complex models
  • When interpretability via examples matters ("here are the 5 most similar cases")

When NOT to Use KNN

  • Large datasets (inference is O(n·d) per query)
  • High-dimensional data (curse of dimensionality)
  • When features are on different scales (unless you scale first)

Optimization: KD-Trees or Ball Trees reduce search from O(n) to O(log n) in low dimensions. For high-dimensional approximate search, use ANN algorithms (FAISS, HNSW).


7. Support Vector Machines (SVM)

SVMs find the maximum-margin hyperplane separating classes. The margin is the gap between the plane and the nearest data points (support vectors).

         ○ ○ ○
          ○  ←── Support vectors
    ─────────────  ← optimal hyperplane (max margin)
          ●  ←── Support vectors
         ● ● ●

The Kernel Trick

Real data is rarely linearly separable. Kernels map data to higher dimensions without computing the transformation explicitly.

Kernel Formula Use case
Linear x·x' Linearly separable data; text classification (high-dim sparse)
RBF (Gaussian) exp(-γ‖x-x'‖²) Most common; non-linear; sensitive to feature scaling
Polynomial (γx·x'+r)^d Image features, NLP

Key hyperparameters:

  • C: Regularization. High C → hard margin (low bias, high variance). Low C → soft margin (high bias, low variance).
  • γ (RBF): High γ → small influence radius → complex boundaries (overfit). Low γ → smoother.

When to use SVM: Small-to-medium datasets, high-dimensional data (text), when interpretability is not required.


8. Logistic Regression

Despite the name, it's a classification algorithm. The output is a probability via the sigmoid function.

P(y=1|x) = σ(w·x + b) = 1 / (1 + exp(-(w·x + b)))
  • Loss: Binary Cross-Entropy = -y·log(p) - (1-y)·log(1-p)
  • Multiclass: Softmax extends this: P(y=k|x) = exp(wₖ·x) / Σⱼ exp(wⱼ·x)

Why use it?

  • Highly interpretable: coefficient = log-odds change per unit increase in feature
  • Fast, scales to millions of samples
  • Calibrated probabilities out of the box
  • Strong baseline for any classification problem

Assumptions: Linearly separable classes (in feature space), no multicollinearity, features roughly scaled.


9. K-Means Clustering

Unsupervised algorithm that partitions n observations into k clusters by minimizing within-cluster variance.

Algorithm:

1. Initialize k centroids randomly (or K-means++ for better initialization)
2. Assign each point to nearest centroid
3. Recompute centroids as mean of assigned points
4. Repeat 2-3 until convergence

Choosing k — The Elbow Method: Plot inertia (within-cluster sum of squares) vs k. The "elbow" is where adding more clusters gives diminishing returns. Also use Silhouette Score (−1 to 1, higher = better separation).

Limitations:

  • Assumes spherical, equal-sized clusters
  • Sensitive to outliers (consider K-Medoids instead)
  • Must specify k upfront
  • Fails on non-convex cluster shapes (use DBSCAN for those)

10. Dimensionality Reduction: PCA

PCA finds the directions (principal components) of maximum variance in the data and projects data onto them.

Original features (high-dim) → PCA transformation → New axes (lower-dim)

How it works:

  1. Standardize features (zero mean, unit variance)
  2. Compute covariance matrix
  3. Eigen-decompose to get eigenvectors (principal components) and eigenvalues (variance explained)
  4. Keep top-k components

Key concepts:

  • Explained variance ratio: How much total variance does each component capture? Choose k where cumulative variance ≥ 95%.
  • Linear transformation only: PCA cannot capture non-linear structure (use t-SNE / UMAP for visualization)

When to use: Remove multicollinearity before linear models, speed up training, reduce storage. Not great for interpretability.


11. Cross-Validation

Robust way to estimate generalization performance without wasting data.

Method How it works Use case
k-Fold CV Split into k folds; train on k-1, test on 1; repeat Standard; k=5 or 10
Stratified k-Fold Preserve class distribution in each fold Imbalanced classification
Leave-One-Out (LOO) k = n; expensive but lowest bias Very small datasets
Time-Series Split Only use past data to predict future Any temporal data — critical to avoid leakage

Train / Validation / Test split philosophy:

  • Train: model learns from this
  • Validation: hyperparameter tuning, model selection
  • Test: final unbiased evaluation — touch it once

12. Class Imbalance

Real-world datasets are often skewed (e.g., fraud detection: 0.1% positive).

Strategy How When to use
Class weights Penalize minority errors more in loss: weight = n_majority / n_minority Always try first — zero cost
Oversampling (SMOTE) Synthesize new minority samples by interpolating between existing ones Moderate imbalance; avoid for high-dimensional data
Undersampling Randomly remove majority samples When majority class is huge; risks losing information
Threshold tuning Move decision boundary from 0.5 to optimize precision/recall Any probabilistic classifier
Focal Loss Down-weight easy examples so training focuses on hard ones Neural networks with severe imbalance (object detection, fraud)
Use PR-AUC / F1 ROC-AUC hides imbalance problems Always with imbalanced data

SMOTE in brief: For each minority sample, find k nearest minority neighbors, generate synthetic sample along the connecting line segment.

Focal Loss in depth: Standard cross-entropy treats all examples equally. With a 99:1 imbalance, the model learns to predict "majority" for everything and still achieves low loss — the minority signal drowns out.

Standard CE:  L = -log(p)
Focal Loss:   L = -α · (1 - p)^γ · log(p)
  • p = predicted probability for the correct class
  • (1 - p)^γ = modulating factor — when p is high (easy example, already well-classified), this term → 0, effectively ignoring it
  • γ (focusing parameter, typically 2): higher γ = stronger down-weighting of easy examples
  • α (class weight): balances frequency difference, same as class_weight in sklearn

Concrete effect: With γ = 2, an easy example with p = 0.9 gets loss weight (1-0.9)² = 0.01 — 100× less than a hard example with p = 0.1. The network spends its gradient budget learning the hard, rare cases.

Originally introduced in RetinaNet for object detection (background vastly outnumbers foreground objects). Works for any neural network loss function — not available for tree models.


13. Feature Engineering

Often more impactful than algorithm choice.

Encoding Categorical Variables

Method How Use case
One-Hot Encoding Binary column per category Low cardinality (< 20 categories), linear models
Label Encoding Map category to integer Tree models only (they handle arbitrary orderings)
Target Encoding Replace category with mean target value High cardinality; use with regularization to avoid leakage
Hashing Map to fixed-size vector via hash function Very high cardinality, online learning

Handling Missing Values

The mechanism of missingness matters as much as the strategy.

Missing data mechanisms:

  • MCAR (Missing Completely At Random): Missingness is unrelated to any variable — a sensor randomly fails. Safe to impute simply.
  • MAR (Missing At Random): Missingness depends on observed variables — older people skip income questions, but given age, the missingness is random. Can impute using other features.
  • MNAR (Missing Not At Random): Missingness depends on the missing value itself — rich people hide income. Simple imputation is biased; need domain knowledge or model the missingness.
Strategy How When
Mean imputation Replace with column mean MCAR numeric; fast baseline; distorts distribution
Median imputation Replace with column median MCAR numeric with outliers — more robust than mean
Mode imputation Replace with most frequent value MCAR categorical
KNN imputation Use k nearest neighbors' values MAR; preserves local relationships; slow at scale
Iterative imputation (MICE) Regress each missing column on others; iterate MAR; best quality; expensive
Model imputation Train a model to predict missing column MNAR; complex but powerful
Add indicator column Binary flag: col_missing = 1 if NaN When missingness itself is predictive (often is)
Drop rows Remove rows with missing values Only if < 1-5% missing AND MCAR AND you have enough data
Tree-native handling XGBoost/LightGBM learn optimal direction for NaN Best for tree models — don't impute, just pass NaN

Critical rule: Fit imputers on training data only, then apply to validation/test. Fitting on the full dataset leaks test distribution — a subtle form of data leakage.

# Correct
imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
X_val_imp   = imputer.transform(X_val)   # use train statistics only

# Wrong — data leakage
imputer.fit(X_all)   # sees test data statistics

Feature Scaling

  • StandardScaler: Zero mean, unit variance. Required for: SVM, logistic regression, PCA, KNN.
  • MinMaxScaler: Scale to [0,1]. Use when data has bounded range or for neural networks.
  • RobustScaler: Uses median and IQR. Use when outliers are present.
  • Trees don't need scaling — splits are monotone-invariant.

14. Data Leakage

Data leakage occurs when information from outside the training window "leaks" into the model, producing optimistically biased evaluation metrics that don't hold in production. It's one of the most common and costly mistakes in applied ML.

Types of Leakage

1. Target Leakage (most common) A feature that is a direct consequence of the target — it's only available after the outcome is known.

Example: Predicting loan default
Leaked feature: "loan_was_restructured" (only happens after default)
Without it: model seems mediocre
With it: model gets 99% accuracy in training; 50% in production

2. Train-Test Contamination Test data influences the training pipeline in any way.

Leaky practice Why it's wrong Fix
StandardScaler.fit(X_all) Test mean/std leaks into train scaling Fit on train only
imputer.fit(X_all) Test distribution leaks into imputation Fit on train only
Feature selection on full dataset Model "knows" test feature variance Select inside CV fold
Hyperparameter search on test set Overfits test set Use held-out validation
SMOTE on full dataset before split Synthetic samples from test in training SMOTE only on train fold

3. Temporal Leakage Using future data to predict the past — extremely common with time-series.

Example: Predicting stock price tomorrow using a moving average
Leaky:    7-day moving average includes days after prediction date
Correct:  Only use data strictly before prediction timestamp

4. Group Leakage Data from the same entity appears in both train and test.

Example: Medical image classification
Leaky:   5 images from same patient in both train and test
Correct: Split by patient_id, not by image

How to Detect Leakage

  • Suspiciously high validation accuracy (> 99% on a hard problem)
  • Feature importance dominated by a single feature you wouldn't have at prediction time
  • Temporal patterns: feature timestamp is after event timestamp
  • Drop feature → big accuracy drop: strong sign it's leaky
  • Shuffle test labels → still high accuracy: model learned from index/order, not features

Prevention Checklist

□ Split train/test BEFORE any preprocessing
□ Fit all transformers (scaler, imputer, encoder) on train fold only
□ For time-series: always split by time, not randomly
□ For groups (patients, users): split by group_id
□ Ask for every feature: "Is this available at prediction time?"
□ Wrap preprocessing in a Pipeline to enforce train-only fitting
□ SMOTE/oversampling only inside training fold

15. High-Dimensional Features

When your dataset has many features (p >> n, or p in the thousands), naive ML breaks down.

Problems with High Dimensionality

  • Curse of dimensionality: In high-dim space, all points become equidistant → KNN, clustering, distance-based methods fail
  • Overfitting: More features than samples → model memorizes training data
  • Multicollinearity: Highly correlated features → unstable coefficient estimates for linear models
  • Computational cost: Training and inference slow down
  • Irrelevant features: Noise features can hurt performance more than they help

Strategy 1: Filter Methods (Before Training)

Remove features before any model is trained. Fast, model-agnostic.

Method How Best for
Variance threshold Remove near-zero variance features Any — quick cleanup
Correlation filter Remove one of any two features with |corr| > 0.95 Linear models
Chi-squared test For categorical features vs categorical target Text, categoricals
Mutual Information Non-linear dependency between feature and target Any type
ANOVA F-test Linear association between numeric feature and target Numeric features

Strategy 2: Wrapper Methods (Model-in-the-Loop)

Use a model's performance to select features. More accurate but expensive.

Method How
RFE (Recursive Feature Elimination) Train model, remove lowest-importance feature, repeat
Forward selection Start with 0 features, add one at a time (best gain)
Backward elimination Start with all features, remove one at a time

Strategy 3: Embedded Methods (During Training)

Feature selection is part of the model fitting.

Method How
L1 Regularization (Lasso) Drives irrelevant feature weights to exactly 0
Tree-based importance Random Forest / XGBoost: use feature importance scores
ElasticNet L1 + L2 — sparse but handles correlated features

Strategy 4: Dimensionality Reduction

Transform features into a smaller set of derived features.

Method Type Use case
PCA Linear Remove multicollinearity; speed up linear models
t-SNE Non-linear Visualization only — not for preprocessing
UMAP Non-linear Visualization + preprocessing; faster than t-SNE
Autoencoders Deep learning Complex non-linear structure; image/text features
TruncatedSVD (LSA) Linear Sparse matrices (text/TF-IDF); memory-efficient

Practical Decision Flowchart

p > 10,000 features?
    ├── Yes → Start with variance threshold + correlation filter (fast wins)
    │         → Then L1 / RFE / tree importance
    │         → PCA as last resort (loses interpretability)
    └── No  → Tree-based importance + SHAP for selection
              → Correlation filter if linear model

n < p (more features than samples)?
    ├── Use regularized models (Ridge, Lasso, ElasticNet)
    ├── PCA to reduce to n < p before any model
    └── Never use unregularized logistic regression or OLS

16. Hyperparameter Tuning

Method How Pros/Cons
Grid Search Try every combination Exhaustive; exponential in dimensionality
Random Search Sample random combinations Surprisingly effective; more efficient than grid for >3 params
Bayesian Optimization Build surrogate model of objective, sample where improvement expected Most sample-efficient; use Optuna or Hyperopt
Halving / Successive Elimination Early-stop bad configs Fast; built into scikit-learn

Rule of thumb: Random search beats grid search in high-dimensional parameter spaces (Bergstra & Bengio, 2012). Use Bayesian optimization when evaluation is expensive.


17. Model Interpretability: SHAP & LIME

Understanding why a model makes a prediction is increasingly critical — for debugging, regulatory compliance, and stakeholder trust. This is one of the hottest ML interview topics.

Feature Importance vs Explainability

Method Scope How it works Pros Cons
Impurity-based importance Global Mean decrease in Gini/entropy across all splits Fast; built into sklearn Biased toward high-cardinality features
Permutation importance Global Shuffle feature, measure accuracy drop Model-agnostic; unbiased Misleading with correlated features
SHAP (SHapley Additive exPlanations) Local + Global Game-theory: contribution of each feature to each prediction Mathematically grounded; consistent Expensive for large models
LIME (Local Interpretable Model-Agnostic Explanations) Local Fit a simple model (linear/tree) on perturbed samples near the point Fast; intuitive Explanations can be unstable
Partial Dependence Plots (PDP) Global Show marginal effect of a feature on prediction Easy to understand Assumes feature independence
ICE (Individual Conditional Expectation) Local PDP for individual instances Shows heterogeneity Can be noisy with many lines

SHAP — The Gold Standard

SHAP values are based on Shapley values from cooperative game theory. For each prediction, each feature gets a value representing its contribution to the difference between the prediction and the average prediction.

f(x) = base_value + SHAP(feature₁) + SHAP(feature₂) + ... + SHAP(featureₙ)

Example (loan approval):
  Base prediction: 65% approval
  + Income:        +20% (high income helps)
  + Credit score:  +10% (good score helps)
  + Debt ratio:    -15% (high debt hurts)
  = Final:          80% approval

Key properties:

  • Additivity: SHAP values sum to the difference between prediction and mean
  • Consistency: If a feature's contribution increases, its SHAP value never decreases
  • Local accuracy: The explanation exactly matches the model output

TreeSHAP: Optimized O(TLD²) algorithm for tree models (vs exponential for exact Shapley). Built into XGBoost and LightGBM.

LIME — Quick Local Explanations

1. Pick instance to explain
2. Generate perturbed samples around it
3. Get model predictions for perturbed samples
4. Fit a simple interpretable model (weighted by proximity)
5. The simple model's coefficients = explanation

When to use LIME over SHAP: When you need fast, approximate explanations for black-box models (neural nets, APIs) and SHAP is too expensive.

Interview tip: "For tree models, I'd use TreeSHAP (fast, exact). For neural networks or black-box APIs, I'd use LIME or KernelSHAP. For global understanding, I'd plot SHAP summary plots (beeswarm). For regulatory compliance, SHAP's mathematical guarantees (consistency, additivity) are stronger than LIME."


18. Probability Calibration

A model's predicted probability should match the true frequency. If a model predicts 80% confidence for 1000 samples, roughly 800 should actually be positive. Many models are not well-calibrated out of the box.

Why Calibration Matters

  • Decision thresholds: If probabilities are miscalibrated, tuning the classification threshold produces unexpected results
  • Risk scoring: In medical, fraud, or insurance applications, probabilities directly drive decisions
  • Model combination: Ensembling requires calibrated scores to weight models properly
  • Ranking is not enough: A model can rank well (high AUC) but have terrible probability estimates

Which Models Are Well-Calibrated?

Model Calibration quality Why
Logistic Regression Good (inherently calibrated) Optimizes log loss directly
Random Forest Poor — overconfident near 0/1 Votes are discrete fractions
XGBoost / GBMs Moderate — depends on objective Log loss objective helps
Neural Networks Often overconfident Modern NNs tend to be miscalibrated
SVM No probabilities by default Needs Platt scaling to output probs
Naive Bayes Very poor — extreme probabilities Independence assumption distorts

Calibration Methods

Method How it works When to use
Platt Scaling Fit a logistic regression on model outputs vs true labels Binary classification; works well with sigmoid-shaped distortions
Isotonic Regression Fit a non-decreasing step function More flexible; needs more data (~1000+ samples)
Temperature Scaling Divide logits by learned T before softmax Neural networks; single parameter; multiclass

Measuring Calibration

  • Reliability diagram: Plot predicted probability vs observed frequency in bins. Perfect calibration = diagonal line.
  • Expected Calibration Error (ECE): Weighted average of |predicted_prob - observed_freq| across bins. Lower = better.
  • Brier Score: mean((predicted - actual)²) — combines calibration and discrimination.
Reliability Diagram:
Observed freq
     1.0 |          ·  /
         |        ·  /
     0.5 |    ·   /       ← well-calibrated (near diagonal)
         |  ·  /
     0.0 |·/
         └────────────
          0.0       1.0
         Predicted probability

Interview tip: "I'd check calibration with a reliability diagram and ECE. If the model is overconfident (most are), I'd apply Platt scaling for binary or temperature scaling for multiclass. Calibration is essential when probabilities drive business decisions — 'is this user 90% likely to churn?' needs to actually mean 90%."


19. Stacking & Advanced Ensembles

Beyond bagging and boosting, stacking (stacked generalization) is a powerful ensemble technique that uses a meta-learner to combine base model predictions.

How Stacking Works

Training:
  Fold 1: Train Models A,B,C on folds 2-5 → predict on fold 1
  Fold 2: Train Models A,B,C on folds 1,3-5 → predict on fold 2
  ...
  Result: out-of-fold predictions from each base model

  Meta-learner: Train on [pred_A, pred_B, pred_C] → final prediction

Inference:
  Input → Model A prediction ─┐
        → Model B prediction ─┤→ Meta-learner → final prediction
        → Model C prediction ─┘

Key rules:

  • Base models should be diverse (e.g., Random Forest + XGBoost + Logistic Regression + Neural Net)
  • Use out-of-fold predictions for meta-learner training to avoid data leakage
  • Meta-learner is typically simple (logistic regression, ridge regression) to avoid overfitting
  • Can stack multiple levels (but diminishing returns after 2)

Stacking vs Bagging vs Boosting

Bagging Boosting Stacking
Base models Same type, different data Same type, sequential Different types
Training Parallel Sequential Two-stage
Reduces Variance Bias Both (via diversity)
Example Random Forest XGBoost RF + XGB + LR → meta

Blending (Simplified Stacking)

Use a simple holdout set (instead of cross-validation) for meta-learner training. Faster but wastes data.

Interview tip: Stacking typically wins Kaggle competitions but adds complexity in production. In interviews, mention it to show depth but note the operational trade-off.


20. Learning Curves & Diagnosing Model Performance

Learning curves plot model performance vs training set size or training iterations. They're the primary diagnostic tool for bias-variance analysis.

Training Size Learning Curve

Error
  │
  │  ─── Training error (rises as data grows — harder to memorize)
  │  
  │  ─── Validation error (drops as data grows — better generalization)
  │
  │            Gap = variance
  └─────────────────────────── Training set size

High Bias (underfitting):         High Variance (overfitting):
Error                              Error
  │  val ──────────────             │  val ──────────
  │                                 │        \        ← large gap
  │  train ────────────             │           train ──
  │  ↑ both high, close gap         │  ↑ small train err, big gap
  └──────────────────               └──────────────────
   Training set size                 Training set size

What Learning Curves Tell You

Pattern Diagnosis Action
Both errors high, converging High bias More features, bigger model, less regularization
Train low, val high, not converging High variance More data, regularization, simpler model
Both errors low, converging Good fit Deploy
Train error increasing with data Normal — harder to memorize more data Not a problem
Val error not improving with more data Model has learned all it can from this feature set Better features, different algorithm

Practical Usage

from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5
)

Interview tip: When asked "how would you debug a model that's performing poorly?", learning curves should be your first tool. They tell you whether you need more data (variance problem) or a better model (bias problem).


21. Interview Quick-Reference

"Your model is overfitting, what do you do?" → More training data → add regularization (L2) → reduce model complexity → dropout / early stopping → feature selection to remove noise

"Your model has high bias, what do you do?" → More features / better features → bigger model → reduce regularization → add polynomial/interaction terms

"How do you handle imbalanced data?" → class_weight='balanced' first → tune threshold → SMOTE if needed → use F1/PR-AUC metrics

"Explain XGBoost vs Random Forest" → RF: parallel bagging, each tree independent, robust, slower to improve with more trees. XGBoost: sequential boosting, each tree fixes previous errors, usually higher accuracy on tabular data, more hyperparameters to tune.

"What feature importance methods exist?" → Tree-based (mean decrease impurity — fast but biased toward high cardinality), Permutation importance (model-agnostic, unbiased), SHAP values (local + global, most reliable).

"What is data leakage and how do you prevent it?" → Leakage = information from outside the training window contaminates the model. Types: target leakage (feature only available after the event), train-test contamination (scaler/imputer fit on all data), temporal leakage (using future data), group leakage (same user in train and test). Prevention: split first, then fit transformers on train only, wrap in Pipeline, always ask "is this feature available at prediction time?"

"You have 50,000 features. What do you do?" → First, remove near-zero variance features and high-correlation pairs (fast filter). Then L1 regularization to get sparse weights, or tree-based importance to rank and prune. Use PCA to reduce to manageable dimensions if linear model needed. Never use unregularized models when p >> n.

"Bagging vs Boosting — what does each reduce?" → Bagging reduces variance by averaging independent models (Random Forest). Boosting reduces bias by sequentially fixing errors (XGBoost). Bagging is parallelizable; Boosting is sequential. Use Random Forest as a fast robust baseline; use XGBoost when you need maximum accuracy on tabular data.

"When would you use Naive Bayes?" → Text classification (spam, sentiment) with bag-of-words/TF-IDF features. It's fast (O(n·d) training), works well with high-dimensional sparse data, and gives a strong baseline that's surprisingly hard to beat on small text datasets. Fails when features are highly correlated.

"Explain KNN and its limitations" → Store all training data. Classify by majority vote of k nearest neighbors. No training phase (lazy learning). Main limitations: O(n·d) inference, curse of dimensionality makes distances meaningless in high dimensions, requires feature scaling. Use KD-trees for speedup. Good for small datasets and when you want example-based explanations.

"How do you explain your model's predictions?" → For tree models: TreeSHAP (fast, exact Shapley values). For any model: permutation importance (global) or LIME (local). SHAP is the gold standard — each feature gets a value representing its contribution, values are additive and consistent. For regulatory compliance, SHAP's mathematical guarantees are strongest.

"Your model has good AUC but bad real-world performance" → Likely a calibration problem. Good AUC means the model ranks well, but the predicted probabilities may not be trustworthy. Check with a reliability diagram. Apply Platt scaling (binary) or temperature scaling (multiclass). Calibration matters when probabilities drive business decisions (loan risk, insurance pricing).