Neural networks, optimization, CNNs, RNNs, attention, and training at scale. These are the topics that come up when a company wants to know you actually understand what's happening under the hood — not just
model.fit().
A neural network is a composition of linear transformations interleaved with non-linear activation functions.
Input x → [Linear: Wx + b] → [Activation σ] → [Linear: W'h + b'] → [Activation σ'] → Output
Why non-linearities? Without them, stacking linear layers collapses to a single linear transformation. Non-linearities allow the network to approximate any function.
| Function | Formula | Pros | Cons | Use case |
|---|---|---|---|---|
| Sigmoid | 1/(1+e^-x) |
Output [0,1]; probabilities | Vanishing gradient for large |x|; saturates | Output layer (binary classification) |
| Tanh | (e^x - e^-x)/(e^x + e^-x) |
Zero-centered; [-1,1] | Still vanishes for large |x| | RNNs hidden state |
| ReLU | max(0, x) |
No vanishing gradient; fast | Dying ReLU: dead neurons if always negative | Default for hidden layers |
| Leaky ReLU | max(αx, x), α≈0.01 |
Fixes dying ReLU | Extra hyperparameter | When ReLU neurons die |
| GELU | x·Φ(x) |
Smooth, stochastic; better than ReLU empirically | Slower to compute | Transformers, BERT, GPT |
| Softmax | e^xᵢ / Σe^xⱼ |
Outputs sum to 1; probabilities | Can saturate; numerically unstable (use log-softmax) | Output layer (multiclass) |
Dying ReLU problem: If a neuron's pre-activation is always negative, gradient is 0, weights never update. Fix: Leaky ReLU, ELU, careful weight init.
Poor initialization causes vanishing/exploding gradients before training even starts. The goal: keep the variance of activations and gradients roughly constant across layers.
If all weights start at zero: every neuron computes the same output, every gradient is the same, the network never breaks symmetry — it stays stuck. You must initialize with random values, but the scale matters enormously.
Too large: Activations explode → saturated neurons → vanishing gradients (for sigmoid/tanh) Too small: Activations shrink to zero → gradients vanish immediately
| Method | Formula | Designed for | Key property |
|---|---|---|---|
| Random Normal | N(0, 0.01) |
— (naïve) | Often too small for deep nets |
| Xavier / Glorot | Uniform(-√(6/(nᵢₙ+nₒᵤₜ)), √(...)) |
Sigmoid / Tanh | Keeps variance equal in/out |
| He (Kaiming) | N(0, √(2/nᵢₙ)) |
ReLU | Corrects for ReLU killing half its inputs |
| Orthogonal | Initialize weight matrix as random orthogonal matrix | Deep RNNs | Preserves gradient magnitude exactly |
| Pretrained weights | Copy from a pretrained model | Transfer learning | Best starting point when available |
Xavier derivation intuition: For a linear layer y = Wx, if each wᵢⱼ ~ N(0, σ²), then Var(y) = nᵢₙ · σ² · Var(x). To keep Var(y) = Var(x), we need σ² = 1/nᵢₙ. Xavier averages fan-in and fan-out: σ² = 2/(nᵢₙ + nₒᵤₜ).
He initialization: ReLU zeroes out half the values, reducing variance by 2×. He corrects by doubling: σ² = 2/nᵢₙ.
# PyTorch defaults (already correct for common cases)
nn.Linear → Kaiming Uniform by default
nn.Conv2d → Kaiming Uniform by default
nn.Embedding → Normal(0, 1) by default
# Manual init
torch.nn.init.xavier_uniform_(layer.weight) # for tanh/sigmoid
torch.nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')BatchNorm normalizes activations at each layer, making the network much less sensitive to initialization. With BatchNorm, you can often get away with less careful initialization. But with LayerNorm (Transformers), initialization still matters.
The algorithm that computes gradients by applying the chain rule backward through the computation graph.
Forward pass: compute loss L
Backward pass: ∂L/∂w = ∂L/∂z · ∂z/∂w (chain rule)
Key insight: Each layer's gradient depends on the next layer's gradient multiplied by the local Jacobian. This is why deep networks suffer from:
- Vanishing gradients: Products of many small values → gradient ≈ 0 → early layers stop learning. Solved by: ReLU, skip connections (ResNet), layer normalization.
- Exploding gradients: Products of many large values → gradient → ∞ → training diverges. Solved by: gradient clipping, careful initialization.
Gradient clipping: if ||g|| > threshold: g = g * threshold / ||g||
| Optimizer | Update rule | Key properties | When to use |
|---|---|---|---|
| SGD | w -= η·g |
Simple; can generalize better with momentum | Image models with tuned LR schedule |
| SGD + Momentum | v = β·v - η·g; w += v |
Accelerates in consistent directions; dampens oscillations | Most CV tasks |
| RMSprop | s = β·s + (1-β)·g²; w -= η·g/√(s+ε) |
Adaptive per-parameter LR; handles non-stationary | RNNs |
| Adam | Momentum + RMSprop; bias correction | Fast convergence; default choice | Transformers, NLP, general DL |
| AdamW | Adam + decoupled weight decay | Fixes L2 regularization in Adam | BERT, GPT, fine-tuning |
Adam update rule:
m = β₁·m + (1-β₁)·g # 1st moment (momentum)
v = β₂·v + (1-β₂)·g² # 2nd moment (variance)
m̂ = m/(1-β₁ᵗ) # bias correction
v̂ = v/(1-β₂ᵗ) # bias correction
w -= η · m̂/(√v̂ + ε)
Default: β₁=0.9, β₂=0.999, ε=1e-8
Adam's L2 problem: Adam scales gradient by 1/√v, so weight decay λw gets divided by √v too — it's not true L2 regularization. AdamW applies weight decay separately: w = w - η·(m̂/√v̂) - η·λ·w.
Learning rate is often the most important hyperparameter. Start high (explore), end low (refine).
| Schedule | How it works | Use case |
|---|---|---|
| Step Decay | Multiply by γ every N epochs | Simple; requires manual tuning |
| Cosine Annealing | LR = η_min + 0.5(η_max - η_min)(1 + cos(πt/T)) | Standard in modern DL; smooth decline |
| Warmup + Cosine | Linear warmup for K steps, then cosine decay | Transformers — avoids unstable early training |
| Cyclic LR | Oscillate between LR bounds | Escapes local minima; good with SGD |
| One Cycle | Quick rise then slow decline | Fast training; Leslie Smith's approach |
Warmup intuition: At the start, weights are random so gradients are noisy. A high LR would diverge. Warmup slowly increases LR until weights are stable enough to handle it.
Randomly zero out neuron activations with probability p during training.
# Training
mask = (random(shape) > p)
h = h * mask / (1 - p) # scale to maintain expected value
# Inference
h = h # no dropout; expected output already correct- Creates implicit ensemble of 2^n subnetworks
- p = 0.5 for hidden layers, 0.1-0.2 for inputs
- Not effective for convolutional layers (use DropBlock instead)
Normalize each mini-batch to have zero mean and unit variance, then apply learnable scale (γ) and shift (β).
BN(x) = γ · (x - μ_batch) / √(σ²_batch + ε) + β
Why it helps:
- Reduces internal covariate shift (layer inputs stabilized)
- Acts as regularizer (noise from batch statistics)
- Allows higher learning rates
- Reduces sensitivity to initialization
BN vs Layer Norm:
| Batch Norm | Layer Norm | |
|---|---|---|
| Normalize over | Batch dimension (per feature) | Feature dimension (per sample) |
| Good for | CNNs, large batches | Transformers, RNNs, small/variable batches |
| Inference | Uses running statistics | No dependency on other samples |
Designed for spatial data (images, audio spectrograms, 1D signals) by exploiting translation invariance and locality.
Output[i,j] = Σ Σ Input[i+m, j+n] · Kernel[m,n]
- Local connectivity: Each neuron sees only a small receptive field
- Weight sharing: Same kernel applied everywhere → drastically fewer parameters
- Strides: Skip pixels to downsample (stride=2 halves spatial dims)
- Padding:
samepadding keeps spatial dimensions;validshrinks
| Layer | Purpose |
|---|---|
| Conv2D | Learn spatial features |
| MaxPool | Downsample, translation-invariant feature detection |
| GlobalAvgPool | Replace flattening; average entire feature map |
| BatchNorm | Normalize after convolution |
| 1×1 Conv | Change channel depth without spatial operation; "bottleneck" |
The region of input a neuron "sees." Deeper layers have larger receptive fields due to stacked convolutions.
3×3 conv after 3×3 conv → 5×5 effective receptive field (2 layers)
Dilated conv with dilation=2 → expands receptive field without more parameters
| Model | Key innovation | Concept |
|---|---|---|
| AlexNet | ReLU, dropout, GPU training | First modern deep CNN |
| VGG | Very deep (16-19 layers) with small 3×3 filters | Deep but slow |
| ResNet | Skip connections (residual blocks) | Solved vanishing gradients; can train 100+ layers |
| InceptionNet | Parallel branches with different filter sizes | Multi-scale feature extraction |
| EfficientNet | Compound scaling (width × depth × resolution together) | State-of-art on ImageNet per parameter count |
ResNet skip connections:
y = F(x, W) + x ← instead of y = F(x, W)
If F = 0, layer becomes identity. Gradient flows directly through shortcut → no vanishing.
Processes sequences by maintaining hidden state h:
hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + b)
Problem: Vanishing gradient over long sequences. Gradients are multiplied by Wₕ at every step. If |Wₕ| < 1, gradient decays to zero; if |Wₕ| > 1, it explodes.
Adds a cell state (c) as a "highway" with gated writes and reads:
Forget gate: f = σ(Wf·[h,x] + b) ← what to erase from cell
Input gate: i = σ(Wi·[h,x] + b) ← what new info to add
Cell update: c̃ = tanh(Wc·[h,x] + b)
Cell state: cₜ = f·cₜ₋₁ + i·c̃ ← gated update; gradient highway
Output gate: o = σ(Wo·[h,x] + b)
Hidden state: hₜ = o · tanh(cₜ)
Why it works: Cell state c only undergoes additive updates (not multiplicative), so gradients can flow back through time without vanishing.
GRU: Simplified LSTM with 2 gates instead of 3; fewer parameters; similar performance.
For most sequence tasks, Transformers win. But LSTMs are still used in:
- Real-time streaming (low latency, no full context needed)
- Very long sequences where attention is O(n²) is too expensive
- On-device / embedded ML
The precursor to Transformers. Instead of forcing all information through a fixed-size hidden state, attention lets the decoder look back at all encoder outputs.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
- Query (Q): What we're looking for (current decoder state)
- Key (K): What each encoder output offers
- Value (V): The actual information to retrieve
The dot product QKᵀ computes similarity scores. Softmax converts to attention weights. Weighted sum of V is the context vector.
Why scale by √dₖ? Large dot products push softmax into saturation (very small gradients). Scaling keeps the distribution well-behaved.
Run h parallel attention heads with different learned projections, then concatenate:
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · Wₒ
where headᵢ = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ)
Allows the model to attend to information from different representation subspaces simultaneously.
Train on a large dataset (ImageNet, Common Crawl), then adapt to a specific task.
| Strategy | When to use | How |
|---|---|---|
| Feature extraction | Small target dataset; features generalize well | Freeze all pretrained layers; only train classifier head |
| Fine-tuning (partial) | Medium dataset; lower layers too general | Freeze early layers; unfreeze later layers + head |
| Full fine-tuning | Large target dataset | Unfreeze everything; use small LR (e.g., 1e-5) |
Learning rate tip: Use a smaller LR for pretrained layers, larger for new head ("discriminative LR"). This avoids overwriting useful pretrained representations.
Early layers → edges, textures (universal; keep frozen)
Middle layers → shapes, parts (sometimes fine-tune)
Late layers → task-specific features (fine-tune)
Head → replace entirely for new task
| Loss | Formula | Use case |
|---|---|---|
| Binary CE | -y·log(p) - (1-y)·log(1-p) |
Binary classification |
| Categorical CE | -Σ yᵢ·log(pᵢ) |
Multiclass classification |
| Focal Loss | -α(1-p)^γ · log(p) |
Class imbalance (object detection) |
| MSE | (y-ŷ)² |
Regression; sensitive to outliers |
| Huber Loss | MSE for small errors, MAE for large | Robust regression |
| Contrastive Loss | Pull same-class embeddings together, push different apart | Siamese networks |
| Triplet Loss | anchor-positive < anchor-negative by margin | Metric learning, face recognition |
Focal Loss intuition: γ > 0 down-weights easy examples (1-p)^γ → model focuses on hard examples. Used in RetinaNet for object detection with severe foreground/background imbalance.
Generative models learn the underlying data distribution P(x) and can generate new samples. They are among the most commonly asked deep learning interview topics.
Compress data into a low-dimensional latent space, then reconstruct. The bottleneck forces the network to learn a compressed representation.
Input x → [Encoder] → Latent z (bottleneck) → [Decoder] → Reconstructed x̂
Loss = ||x - x̂||² (reconstruction error)
| Variant | Key difference | Use case |
|---|---|---|
| Vanilla AE | Deterministic bottleneck | Dimensionality reduction, denoising |
| Denoising AE | Add noise to input, reconstruct clean version | Robust feature learning |
| Sparse AE | L1 penalty on latent activations | Feature discovery with sparsity |
| Variational AE (VAE) | Latent space is a probability distribution | Generation, interpolation |
Instead of encoding to a fixed point, encode to a distribution (mean μ and variance σ²). Sample from this distribution to decode.
Input x → Encoder → (μ, σ²) → z = μ + σ·ε (reparameterization trick) → Decoder → x̂
Loss = Reconstruction Loss + KL Divergence
= ||x - x̂||² + KL(q(z|x) || p(z))
Reparameterization trick: To backpropagate through sampling, express z = μ + σ·ε where ε ~ N(0,1). Gradients flow through μ and σ, not through the sampling operation.
KL Divergence term: Regularizes the latent space to be close to N(0,1), ensuring smooth interpolation between data points.
Two networks in a minimax game:
- Generator G: Takes random noise z → generates fake data G(z)
- Discriminator D: Classifies inputs as real or fake
Random noise z → [Generator G] → Fake image G(z) ─┐
Real image x ──────────────────────────────────────┤→ [Discriminator D] → Real/Fake
│
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
Training dynamics:
- Train D to distinguish real from fake (maximize correct classification)
- Train G to fool D (minimize D's ability to tell fake from real)
- At equilibrium, G produces data indistinguishable from real
Common GAN problems:
| Problem | What happens | Fix |
|---|---|---|
| Mode collapse | Generator produces limited variety of outputs | Wasserstein loss, minibatch discrimination |
| Training instability | Loss oscillates, never converges | Spectral normalization, progressive growing |
| Vanishing gradients for G | D becomes too good, G gradient → 0 | Wasserstein GAN (WGAN), gradient penalty |
Key GAN variants:
- DCGAN: Convolutional architecture, first stable GAN for images
- WGAN: Uses Wasserstein distance, much more stable training
- StyleGAN: State-of-art face generation, style-based generator
- Conditional GAN (cGAN): Condition generation on a label or image (pix2pix)
The current state-of-the-art for image generation (DALL-E, Stable Diffusion, Midjourney).
Forward process (fixed): x₀ → x₁ → x₂ → ... → xₜ (gradually add Gaussian noise)
Reverse process (learned): xₜ → xₜ₋₁ → ... → x₀ (learn to denoise step by step)
Key intuition: Training is simple — add noise to data, train a neural network (U-Net) to predict and remove the noise. At generation time, start from pure noise and iteratively denoise.
Why diffusion beats GANs:
- No adversarial training instability
- No mode collapse — learns the full distribution
- Better sample quality and diversity
- Easier to train (simple MSE loss on noise prediction)
Trade-off: Diffusion is slow at inference (many denoising steps). Solutions: DDIM (fewer steps), latent diffusion (work in compressed space), distillation.
| VAE | GAN | Diffusion | |
|---|---|---|---|
| Training | Stable | Unstable | Stable |
| Sample quality | Blurry | Sharp | Best |
| Diversity | Good | Mode collapse risk | Best |
| Speed | Fast | Fast | Slow |
| Latent space | Smooth, interpretable | No clear latent space | No clear latent space |
| Use case | Representation learning, interpolation | Image synthesis, style transfer | Image/video generation, art |
Transfer knowledge from a large "teacher" model to a smaller "student" model. The student learns to match the teacher's output distribution (soft labels), not just the hard labels.
Input x → Teacher (large, frozen) → soft predictions (logits / T)
→ Student (small, trainable) → soft predictions (logits / T)
Loss = α · KL(teacher_soft, student_soft) + (1-α) · CE(hard_labels, student_hard)
Why soft labels help: A teacher predicting [cat: 0.7, dog: 0.2, bird: 0.1] conveys richer information than the hard label "cat" — it tells the student that this image somewhat resembles a dog. This "dark knowledge" helps the student generalize better.
Temperature T: Higher T softens probability distribution, making the dark knowledge more visible. Typical T = 3-20 during distillation; T = 1 at inference.
| Scenario | Teacher → Student | Why |
|---|---|---|
| Model compression | BERT-large → DistilBERT | 60% size, 97% performance, 2× faster |
| LLM distillation | GPT-4 → small fine-tuned model | Cost reduction for specific tasks |
| Ensemble → single model | 5 models → 1 model | Production simplicity |
| Self-distillation | Model → same architecture | Born-Again Networks; improves over original |
Interview tip: Distillation is how most production LLM applications work — you don't run GPT-4 for every request. You distill its behavior for your specific task into a smaller, cheaper model.
Artificially expand training data by applying label-preserving transformations. Often more impactful than model improvements.
| Technique | Transform | Effect |
|---|---|---|
| Random crop + resize | Crop random region, resize to original | Translation invariance; most impactful single augmentation |
| Horizontal flip | Mirror image | Doubles data; don't use for asymmetric tasks (text OCR) |
| Color jitter | Random brightness, contrast, saturation, hue | Robustness to lighting conditions |
| Random erasing / Cutout | Black out random patches | Forces model to use all features, not just dominant ones |
| Mixup | Blend two images and their labels: x̃ = λx₁ + (1-λ)x₂ |
Smoother decision boundaries; reduces overfitting |
| CutMix | Paste a patch from one image onto another; blend labels | Combines benefits of Cutout and Mixup |
| RandAugment | Random sequence of N augmentations from a pool | AutoML-style; single hyperparameter M (magnitude) |
| AutoAugment | RL-searched augmentation policies | Best quality; expensive to find |
| Technique | How | Use case |
|---|---|---|
| Synonym replacement | Replace words with synonyms | Simple; low risk |
| Back-translation | Translate to French → back to English | Paraphrasing; maintains meaning |
| Random insertion / deletion | Add or remove random words | Robustness |
| EDA (Easy Data Augmentation) | Combine above techniques | Small datasets |
| LLM-based augmentation | Use GPT to generate paraphrases | Highest quality; expensive |
Key principle: Augmentations must be label-preserving. Flipping a chest X-ray horizontally would swap left/right lungs — a medical error.
Learn representations from unlabeled data by creating pretext tasks. The model learns features that transfer well to downstream tasks, eliminating the need for expensive labels.
Image → Augmentation 1 → Encoder → z₁ ─┐
→ Augmentation 2 → Encoder → z₂ ─┤→ Pull z₁, z₂ together (positive pair)
└→ Push z₁, z_other apart (negative pairs)
SimCLR loss (NT-Xent): Maximize agreement between different views of the same image, minimize agreement with other images in the batch.
Key insight: Large batch sizes are critical (SimCLR uses 4096+) because you need many negative pairs. MoCo avoids this by maintaining a momentum-updated queue of negatives.
Learn without explicit negative pairs. BYOL uses a momentum-updated target network:
Online: Image → augment → encoder → predictor → output
Target: Image → augment → momentum_encoder → target
Loss = ||output - target||² (target is stop-gradient)
Why it doesn't collapse: The asymmetry between online (with predictor) and target (momentum-updated) networks prevents the trivial solution of mapping everything to the same vector.
Inspired by BERT's masked language modeling, but for images:
Image → mask 75% of patches → Encoder (on visible patches) → Decoder → reconstruct masked patches
Extremely data-efficient; works with standard ViT architecture.
- Pre-training vision models when labeled data is scarce (medical imaging, satellite)
- Pre-training language models (BERT, GPT — these ARE self-supervised)
- Embedding learning for retrieval and similarity search
- Foundation models that transfer to many downstream tasks
Larger batches → more stable gradients but:
- Often worse generalization (sharp minima in loss landscape)
- Higher LR needed to compensate (linear scaling rule: LR ∝ batch_size)
- Memory limitations on GPU
Linear scaling rule (Goyal et al.): When multiplying batch size by k, multiply LR by k. But this breaks for very large batches → use warmup.
Store weights in FP32 but compute forward/backward passes in FP16:
| FP32 | FP16 | BF16 | |
|---|---|---|---|
| Memory | 4 bytes | 2 bytes | 2 bytes |
| Dynamic range | Wide | Narrow (overflow risk) | Wide |
| Precision | High | Lower (underflow risk) | Lower |
| Use case | Master weights | Computation | Preferred in TPUs/Ampere GPUs |
Gradient scaling: Multiply loss by large factor before backward pass to prevent FP16 underflow; divide gradients back before optimizer step.
| Data Parallelism | Model Parallelism | |
|---|---|---|
| What | Each GPU gets a full model; data is split | Model is split across GPUs |
| When | Model fits in single GPU | Model too large for one GPU (LLMs) |
| Sync | Gradient averaging (AllReduce) | Pipeline or tensor parallelism |
When the desired batch size doesn't fit in GPU memory, accumulate gradients over multiple mini-batches before updating weights:
Effective batch size = mini_batch_size × accumulation_steps
Step 1: Forward + backward (mini-batch 1) → accumulate gradients
Step 2: Forward + backward (mini-batch 2) → accumulate gradients
Step 3: Forward + backward (mini-batch 3) → accumulate gradients
Step 4: optimizer.step() → update weights with averaged gradients
optimizer.zero_grad() → reset
This achieves the same gradient as a large batch without the memory cost. Trade-off: slower training (more forward/backward passes per update), but mathematically equivalent to large-batch training.
Standard multi-head attention uses separate K, V projections per head. This creates a massive KV cache during inference.
| Method | K/V heads | KV cache size | Quality | Used by |
|---|---|---|---|---|
| Multi-Head (MHA) | h heads each | 100% (baseline) | Best | GPT-3, BERT |
| Multi-Query (MQA) | 1 shared K/V | 1/h of baseline | Slight drop | PaLM, Falcon |
| Grouped Query (GQA) | g groups share K/V | g/h of baseline | Near MHA | LLaMA-2 70B, Mistral |
MHA: Q₁→K₁V₁, Q₂→K₂V₂, Q₃→K₃V₃, Q₄→K₄V₄ (4 separate KV)
GQA: Q₁→K₁V₁, Q₂→K₁V₁, Q₃→K₂V₂, Q₄→K₂V₂ (2 shared groups)
MQA: Q₁→K₁V₁, Q₂→K₁V₁, Q₃→K₁V₁, Q₄→K₁V₁ (1 shared KV)
Why GQA is the current standard: MQA saves the most memory but hurts quality. GQA with g = h/4 or h/8 preserves nearly all quality while cutting KV cache by 4-8×. Critical for serving large models efficiently.
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss doesn't decrease | LR too low, wrong loss, data bug | Overfit single batch first |
| Loss explodes | LR too high, gradient explode | Reduce LR, clip gradients, check input scale |
| Train loss low, val loss high | Overfitting | Dropout, regularization, more data |
| Both losses plateau early | LR too high; landed in flat region | LR warmup/decay, Adam |
| Validation loss oscillates | LR too high | Reduce LR, use scheduler |
| Dead neurons | ReLU dying | Leaky ReLU, check init, lower LR |
| Slow convergence | Vanishing gradient | BatchNorm, skip connections, better init |
The overfit-single-batch test: Before full training, overfit a single mini-batch. If train loss doesn't reach ~0, there's a bug in model/loss/data — fix this first.
The Transformer is the most important architecture in modern deep learning. It powers every major LLM, vision model (ViT), and multimodal model. Unlike RNNs, it processes the entire sequence in parallel, making it highly GPU-efficient.
┌─────────────────────────────────────────┐
│ TRANSFORMER BLOCK (×N) │
│ │
Input tokens │ ┌──────────┐ │
│ │ │ LayerNorm │ ← Pre-norm (modern) │
▼ │ └────┬─────┘ │
[Token Embedding] │ ▼ │
+ │ ┌──────────────────────┐ │
[Positional Enc] │ │ Multi-Head │ │
│ │ │ Self-Attention │ │
└──────────────► │ └──────────┬───────────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ │ Residual │ x + Attn(x) │
│ └─────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │LayerNorm │ │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ Linear → GELU → Linear│ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ │ Residual │ x + FFN(x) │
│ └─────┬─────┘ │
└─────────────┼───────────────────────────┘
│ (repeat N times)
▼
[LayerNorm]
│
[Linear → Softmax]
│
Output probabilities
Given input X ∈ R^(seq_len × d_model):
1. Project: Q = X·Wᵠ, K = X·Wᵏ, V = X·Wᵛ
(each W ∈ R^(d_model × d_head))
2. Compute scores: S = Q·Kᵀ / √d_head
(S ∈ R^(seq_len × seq_len) — the attention matrix)
3. Mask (decoder only): set future positions to -∞
4. Normalize: A = softmax(S)
(A ∈ R^(seq_len × seq_len) — attention weights, each row sums to 1)
5. Weighted sum: Output = A · V
(Output ∈ R^(seq_len × d_head))
Why divide by √d_head?
Dot products grow with dimensionality: E[q·k] ≈ d_head · σ². For d_head = 64, this is 8×. Large values push softmax into saturation → near-zero gradients. Dividing by √d_head brings values back to unit scale.
Attention matrix intuition:
- Row
irepresents "what does token i attend to?" - Column
jrepresents "how much do all tokens attend to token j?" A[i,j]= attention weight from token i to token j
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · Wₒ
headᵢ = Attention(Q·Wᵢᵠ, K·Wᵢᵏ, V·Wᵢᵛ)
Dimensions:
d_model = 512, h = 8 heads, d_head = 64 each
Total parameters per head: 3 × (d_model × d_head)
Output projection Wₒ: (h·d_head) × d_model = d_model × d_model
Different heads learn to attend to different types of relationships simultaneously:
- Head 1: syntactic agreement (subject-verb)
- Head 2: coreference ("it" → "the cat")
- Head 3: positional adjacency
- Head 4: semantic similarity
In language modeling (GPT-style), token at position t must not see future positions. Achieved by masking the upper triangle of the attention matrix with -∞ before softmax:
Attention mask for seq_len = 4:
t₁ t₂ t₃ t₄
t₁ [ 0 -∞ -∞ -∞ ]
t₂ [ 0 0 -∞ -∞ ]
t₃ [ 0 0 0 -∞ ]
t₄ [ 0 0 0 0 ]
After softmax: upper triangle becomes 0 → no future leakage
Sinusoidal (original paper):
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
The frequency decreases with dimension index — high dims encode coarse position, low dims encode fine position. Allows the model to compute relative positions via linear combinations: PE(pos+k) is a linear function of PE(pos).
RoPE (Rotary Position Embedding) — current standard: Instead of adding position to embeddings, rotate Q and K vectors by an angle proportional to position before computing the dot product:
Rₒₜ(qₘ) · Rₒₜ(kₙ)ᵀ only depends on (m-n) — the relative distance
Why RoPE is preferred:
- Attention scores naturally depend on relative positions, not absolute
- Enables context length extension (YaRN, dynamic NTK scaling)
- No additional parameters
FFN(x) = W₂ · GELU(W₁x + b₁) + b₂
Typical dimensions:
d_model = 4096 (LLaMA-7B)
d_ff = 11008 = 4 × 4096 × (2/3) (SwiGLU variant)
The FFN constitutes ~2/3 of total parameters. Research suggests it acts as a key-value memory store — different neurons fire for different input patterns.
SwiGLU (modern variant, used in LLaMA):
SwiGLU(x, W, V) = Swish(xW) ⊙ (xV)
Gating mechanism that selectively activates features. Outperforms standard GELU in practice.
| Post-Norm (original) | Pre-Norm (modern) | |
|---|---|---|
| Where | After residual: LayerNorm(x + F(x)) |
Before sublayer: x + F(LayerNorm(x)) |
| Training | Harder to train (gradients through LN) | More stable; standard in modern LLMs |
| Performance | Slightly better when trained | Pre-norm usually fine with warmup |
Almost all modern LLMs (LLaMA, Mistral, GPT-NeoX) use Pre-Norm with RMSNorm (simpler than LayerNorm, no mean subtraction).
| Architecture | Attention type | Pre-training | Best for |
|---|---|---|---|
| Encoder only (BERT) | Bidirectional self-attention | Masked LM | Classification, NER, embeddings |
| Decoder only (GPT) | Causal (masked) self-attention | Causal LM | Text generation, chat |
| Encoder-Decoder (T5) | Bidirectional encoder + causal decoder + cross-attention | Seq2Seq denoising | Translation, summarization, QA |
Cross-attention in encoder-decoder: The decoder Q attends to the encoder K/V — each decoder position can look at the full input sequence:
Cross-Attention: Q = decoder state, K = V = encoder output
| Component | Time | Space | Note |
|---|---|---|---|
| Self-attention | O(n²·d) | O(n²) | Quadratic in sequence length! |
| FFN | O(n·d²) | O(d²) | Linear in sequence length |
| Embedding | O(n·d) | O(V·d) | V = vocabulary size |
The quadratic bottleneck: For n=32K tokens, n² = 10⁹ attention scores. Solutions:
- FlashAttention: Tiled GPU computation, avoids materializing full matrix → same output, O(n) memory
- Sliding window attention (Mistral): Each token attends to W neighbors only → O(n·W)
- Linear attention: Approximate kernelized attention → O(n)
Patch images into 16×16 tokens, project each patch to d_model, add positional encoding, feed into standard Transformer encoder:
224×224 image → 14×14 = 196 patches → 197 tokens (+ [CLS]) → Transformer encoder → classify
Key insight: with enough data, ViT matches or beats CNNs. CNNs have inductive bias (translation equivariance) baked in; ViTs learn it from data.
"Explain vanishing gradients" → In deep networks, gradients are products of many partial derivatives. If each < 1 (sigmoid/tanh saturated), the product → 0 and early layers don't learn. Fix: ReLU, skip connections, BatchNorm, gradient clipping.
"Why does BatchNorm help training?" → Stabilizes layer input distributions (internal covariate shift), acts as regularizer, allows higher LR, reduces sensitivity to initialization. Use LayerNorm for Transformers (variable batch/sequence lengths).
"Adam vs SGD — when would you pick SGD?" → Adam converges faster and is the default. But SGD with momentum + tuned LR schedule often achieves better final accuracy for large-scale vision tasks (CIFAR, ImageNet). Adam can over-adapt and find sharp minima.
"How does ResNet solve the degradation problem?" → Skip/residual connections let gradients flow directly to earlier layers without going through weight matrices. The network only needs to learn the residual F(x), which is easier than learning H(x) from scratch.
"Difference between Dropout and BatchNorm?" → Dropout adds noise to activations (regularization by masking). BatchNorm normalizes activations and adds learnable scale/shift (training stability + mild regularization). They're complementary; using both is common.
"Why use He initialization over Xavier for ReLU?" → Xavier assumes symmetric activations around zero. ReLU kills half the inputs (x < 0 → 0), halving the effective variance. He doubles the variance to compensate: σ² = 2/nᵢₙ. Using Xavier with ReLU leads to vanishing activations in deep networks.
"Explain the Transformer's self-attention complexity" → For a sequence of n tokens and model dimension d, self-attention is O(n²·d) time and O(n²) space — quadratic in sequence length. This is the core bottleneck for long contexts. FlashAttention uses tiled GPU SRAM computation to achieve O(n) memory while computing the same result. Sliding window attention limits each token to W neighbors: O(n·W).
"Why does multi-head attention work better than single-head?" → Single head of dimension d gives one "view" of token relationships. h heads of dimension d/h capture h different relationship types simultaneously (syntax, semantics, coreference, etc.) with the same total parameter count. The concatenation then mixes all these views via Wₒ.
"Pre-Norm vs Post-Norm — what changed and why?" → Original Transformer used Post-Norm (LayerNorm after residual), which is harder to train as gradients must flow through LayerNorm. Modern LLMs use Pre-Norm (LayerNorm before sublayer), which gives gradients a clean residual pathway and makes training more stable, especially at scale. Most use RMSNorm (no mean subtraction, faster) instead of full LayerNorm.
"Explain GANs and their training challenges" → Two networks in a minimax game: Generator creates fake data, Discriminator classifies real vs fake. Main challenges: mode collapse (generator produces limited variety), training instability (loss oscillates), vanishing gradients for the generator when the discriminator is too good. Fixes: Wasserstein loss (WGAN), spectral normalization, progressive growing. Diffusion models have largely replaced GANs for image generation due to more stable training and better diversity.
"What is knowledge distillation?" → Train a small student model to match a large teacher model's soft predictions (logits / temperature). Soft labels carry "dark knowledge" — a cat image that somewhat resembles a dog teaches more than the hard label "cat" alone. Temperature T > 1 softens the distribution to expose this knowledge. Used to create DistilBERT (60% of BERT's size, 97% performance), and to distill GPT-4 behavior into smaller task-specific models.
"What is GQA and why is it important?" → Grouped Query Attention shares K/V projections across groups of query heads instead of having unique K/V per head. LLaMA-2 70B uses 8 KV groups for 64 query heads, reducing KV cache by 8×. It's the sweet spot between Multi-Head Attention (best quality, huge KV cache) and Multi-Query Attention (smallest KV cache, quality drop). Critical for efficient LLM serving.
"How does self-supervised learning work?" → Learn representations from unlabeled data via pretext tasks. Contrastive methods (SimCLR): pull augmented views of the same image together, push different images apart. Masked methods (MAE): mask 75% of image patches, train to reconstruct. Language models (BERT, GPT) are inherently self-supervised — predicting masked/next tokens. Self-supervised pre-training enables foundation models that transfer to many downstream tasks with minimal labeled data.