Most interview resources are either too scattered or too theoretical. This repo is organized around three practical tracks:
| Track | Best for | Start here |
|---|---|---|
| Core System Design | Distributed systems, cloud/platform, APIs, storage, scaling | docs/ |
| AI & Machine Learning | ML system design, agents, classic ML, deep learning, LLMs | docs/machine-learning/README.md |
| Reference & Practice Appendix | Templates, cheat sheets, LeetCode patterns, LLD | docs/reference/README.md |
Use the interactive site when you want navigation, quiz mode, and progress tracking. Use the Markdown docs when you want dense references you can skim before an interview.
| Feature | Description |
|---|---|
| 🎯 48 interview-ready topics | Core system design, cloud/platform, AI/ML, security, and interview reference material |
| 🌙 Dark / Light mode | Persisted preference, instant toggle with d |
| ✅ Progress tracking | Mark topics as read. Your progress saves locally. |
| 🔖 Bookmarks | Save topics to revisit. Accessible from any page. |
| 🃏 Quiz / Flashcard mode | Randomized flashcard review across all 48 topics |
| 📖 Inline reader | Read every topic without leaving the page — with prev/next navigation |
| ⌨️ Keyboard-first | / search, q quiz, b bookmarks, ? shortcuts |
| 📊 Visual progress bar | See your overall completion at a glance |
| 🗺️ 3 learning paths | Beginner, Mid-Level, and Advanced tracks |
| 🔍 Live search | Searches title, category, summary, and tags |
| 🎨 Category color coding | Every domain has its own visual identity |
| 🚀 Zero setup | Open in browser. No install. No build step. |
🟠 Foundation (4)
- 📐 The System Design Interview Framework — 4-step universal structure: Clarify → Estimate → Design → Deep Dive
- 🔢 Numbers Every Engineer Must Know — Latency hierarchy, scale reference points, back-of-envelope formulas
- 💾 IO Fundamentals: Read vs Write — Latency hierarchy, random vs sequential access, OS page cache, write amplification
- 🔌 Networking & Concurrency — TCP vs UDP, HTTP/1.1 vs HTTP/2 vs HTTP/3 (QUIC), event loop, goroutines
🟣 Data Storage (5)
- 🗄️ Database Selection Guide — SQL vs NoSQL tension, 7 database types with when-to-use decision matrix
- ⚡ Caching Deep Dive — 5 cache layers, read/write patterns, eviction, cache invalidation strategies
- 📨 Message Queues & Event Streaming — Queue vs Kafka event log, delivery guarantees, DLQ, outbox pattern
- 🌐 Storage & CDN — Object/block/file storage, CDN pull vs push, cache invalidation
- 🔩 Database Internals — B-tree vs LSM, indexes, replication, CDC, sharding, ACID vs BASE, isolation levels
🔵 API & Networking (4)
- 🔌 API Design & API Gateway — REST vs gRPC vs GraphQL, gateway responsibilities, rate limiting algorithms
- ⚖️ Load Balancing & Networking — L4 vs L7, round-robin/least-connections/consistent hashing, health checks
- 🔴 Real-time Communication — Polling, SSE, WebSockets compared; scaling stateful WS servers with Redis pub/sub
- 🚦 Rate Limiting In Depth — Every algorithm compared, distributed Redis implementation, failure modes
☁️ Cloud & Platform (5)
- ☁️ Cloud Fundamentals & Shared Responsibility — Regions, availability zones, managed services, shared responsibility, environment boundaries
- 🖥️ Compute & Deployment Patterns — VMs vs containers vs Kubernetes vs serverless, autoscaling, canary/blue-green rollout
- 🌍 Cloud Networking & Traffic Management — VPCs, subnets, DNS, CDN/WAF, API gateways, service-to-service traffic
- 🪪 IAM, Secrets & Governance — Least privilege, workload identity, secret rotation, KMS, audit and guardrails
- 📉 Reliability, Observability & Cost — Multi-AZ vs multi-region, RTO/RPO, SLOs, budget alarms, cost-aware scaling
🟢 Distributed Systems (5)
- 🌐 Distributed System Fundamentals — CAP, consistency models, consistent hashing, Saga vs 2PC, quorum, vector clocks
- 🔄 Core Design Patterns — Fan-out (social feed), CQRS, event sourcing, outbox pattern, inventory contention
- 🧱 Microservices vs Monolith — When to decompose, service discovery, sync vs async communication
- 🛡️ Resilience Patterns — Timeouts, retries + jitter, circuit breaker, fallbacks, backpressure, load shedding
- 🔒 Distributed Locking — Why local locks fail, Redis Redlock, fencing tokens
🟡 Search & Analytics (4)
- 🔍 Search & Typeahead Systems — Inverted index, prefix trie autocomplete, relevance ranking (TF-IDF, BM25)
- 📊 Stream Processing & Top-K Systems — Count-Min Sketch, Lambda vs Kappa architecture, Flink, windowing
- 📍 Geo & Location Systems — Geohash, quadtree, proximity queries, Uber-style driver matching
- 🎲 Probabilistic Data Structures — Bloom filter, HyperLogLog, Count-Min Sketch at massive scale
🟩 Scale & Reliability (6)
- 📡 Observability & Monitoring — Metrics, logs, traces (three pillars), SLOs, error budgets, OpenTelemetry
- 📈 High Availability & Auto Scaling — Active-passive vs active-active, autoscaling signals, multi-region patterns
- 🆔 Unique ID Generation — UUID v4/v7/ULID, Twitter Snowflake, ticket servers — when to use each
- 📄 API Pagination — Why offset pagination fails, cursor-based and keyset pagination at scale
- 🔔 Notification System Design — Multi-channel delivery, fan-out at scale, idempotency, retry + DLQ
- 🔁 Advanced Data Patterns — Pre-computation, materialized views, ETL vs ELT, hot spot problem, backfill
🔴 Security (4)
- 🔐 Security & Authentication — Sessions vs JWT, OAuth 2.0 flow, API security checklist
- 🪪 Authorization, SSO & MFA — RBAC/ABAC/ReBAC, OIDC vs SAML, step-up authentication, passkeys
- 🛡️ Privacy & Data Compliance — PII handling, encryption strategies, GDPR/CCPA, data residency
- 🔑 Secrets Management & Threat Modeling — secret rotation, API keys, KMS/HSM, STRIDE, attack paths
🩷 AI & Machine Learning (5)
- 🤖 Machine Learning in System Design — feature store, recommendation and ranking systems, rollout strategy, drift, serving latency, rollback
- 🧠 AI Agent System Design — planner/reactor loops, function calling, retrieval, observability, agent benchmarks, model routing, budgets, safety
- 📈 Classic Machine Learning — Bias-variance, Naive Bayes, KNN, bagging vs boosting, SHAP/LIME, calibration, XGBoost, SVM, PCA
- 🔬 Deep Learning — Weight init, backprop, CNNs, LSTMs, full Transformer deep-dive, GANs, VAEs, diffusion, distillation, GQA/MQA
- 💬 LLM Interview Questions — Tokenization, RAG, LoRA/QLoRA, RLHF/DPO, scaling laws, MoE, multi-modal models, KV cache, CoT
🩵 Specialized Systems (2)
- 📝 Real-time Collaboration (Google Docs) — OT vs CRDT, operation logs, full Google Docs architecture
- 🎣 Webhooks System Design — Signed delivery, exponential retry, idempotency keys, full architecture
🟦 Reference (4)
- 🎯 Common Scenarios & Solutions — 17 scenario cheat sheets covering classic patterns plus multi-tenant SaaS, webhooks, recommendation/ranking, and multi-region reliability
- 📋 Reusable Design Templates — 12 full blueprints with architecture diagrams: YouTube, Twitter, WhatsApp, Uber, TinyURL, Rate Limiter, Metrics, TicketMaster, AI Agent, Typeahead, Google Docs, LeetCode
- 🧩 LeetCode Question Patterns — 21 algorithm patterns with code templates: arrays, two pointers, sliding window, trees, graphs, DP, backtracking, tries, segment tree, and more
- 🏗️ Low-Level System Design (LLD) — SOLID principles, 10 design patterns with code, 11 classic LLD questions (LRU Cache, Parking Lot, Elevator, Rate Limiter, ATM, Tic-Tac-Toe, Logger, Library)
Pick a path based on your experience level, then use the interactive site to track your progress.
Interview Framework → Numbers to Know → Database Selection → Caching Deep Dive → API Design & Gateway → Rate Limiting
Distributed Fundamentals → Cloud Fundamentals → Compute & Deployment → Resilience Patterns → Observability → High Availability → Microservices → Notifications → Authorization / MFA
AI Agent System Design → ML System Design → Cloud Networking → IAM / Governance → Reliability, Observability & Cost → Real-time Collaboration → Probabilistic DS → DB Internals
Open the interactive site and press ? to see all shortcuts:
| Key | Action |
|---|---|
/ |
Focus search |
d |
Toggle dark mode |
q |
Start quiz / flashcard mode |
b |
Toggle bookmarks panel |
? |
Show all keyboard shortcuts |
Esc |
Close reader / clear search / close panel |
Space |
Reveal quiz answer |
→ / ← |
Next / previous quiz card or topic |
Option A — Interactive site (recommended)
No install. Works offline after first load. Progress saves to your browser. Includes an inline reader, quiz/flashcard mode, dark mode, and bookmarks.
Option B — Run locally
git clone https://github.com/Ali-Meh619/System_Design_ML_Principles.git
cd System_Design_ML_Principles
# Open site/index.html in your browser — no server neededOption C — Read on GitHub
Navigate to docs/ and click any topic. GitHub renders Markdown natively.
System_Design_ML_Principles/
├── site/ # Interactive web app (no build step)
│ ├── index.html # Main SPA — dark mode, quiz, bookmarks, inline reader
│ ├── styles.css # Full design system with dark/light mode
│ ├── app.js # All interactive features
│ └── topics.js # Topic registry with icons, difficulty, tags, paths
├── docs/ # 48 topic documents
│ ├── foundation/ # Interview framework, estimation, I/O, networking
│ ├── api-networking/ # APIs, load balancing, rate limiting, realtime
│ ├── cloud-platform/ # Cloud foundations, deployment, networking, IAM, reliability
│ ├── data/ # Databases, caching, queues, internals
│ ├── distributed/ # CAP, consistency, microservices, resilience, patterns
│ ├── search/ # Full-text search, typeahead, geo, stream processing
│ ├── scale/ # Observability, HA, ID gen, pagination, notifications
│ ├── security/ # Auth, AuthZ, privacy, secrets, threat modeling
│ ├── machine-learning/ # ML systems, agents, Classic ML, DL, LLMs
│ ├── specialized/ # Collaboration and webhook-heavy systems
│ └── reference/ # Templates, cheat sheets, LeetCode, LLD
└── assets/ # Architecture diagram images
The strongest docs in this repo use a consistent interview-prep structure. Not every legacy page is identical yet, but new and upgraded docs aim to follow this pattern:
## Problem
What are we solving? When does this come up in an interview?
## Options
What are the main approaches? (with trade-off table)
## Recommended Default
What to pick and why, with the specific caveats.
## Failure Modes
What breaks? How do you detect and recover?
## Metrics
What do you measure to know it's working?
## Interview Answer Sketch
The concise 2-minute answer you'd give under time pressure.Contributions make this better for everyone:
- Fork the repo and create a branch:
git checkout -b feat/your-topic - Follow the recommended topic structure above — especially defaults, trade-offs, failure modes, and metrics
- Add the topic to
site/topics.jswith an icon, difficulty, and tags - Open a PR using the provided template
Every substantial addition should include:
- ✅ When to use
- ❌ When NOT to use
- 💥 Common failure modes
- 📊 Measurable success metrics
See CONTRIBUTING.md for the full guide.
- Star the repo — it helps others discover it
- Share it with your team or study group
- Open issues for topics you'd like to see covered
- Submit PRs to improve existing content
MIT © 2026. Free to use, share, and build on.
Built with ❤️ for engineers who take system design seriously.