How to use this file: Map any interview question to one of 12 battle-tested blueprints below. Each template follows the same structure: understand the pattern → confirm requirements → estimate scale → walk through the architecture → nail the trade-offs. Read the template before your interview, and you will know exactly what to say.
Every template below follows this interview-ready structure:
| Section | What to do | Time |
|---|---|---|
| Recognize the Pattern | Say which template this maps to and why | 30 sec |
| Clarify Requirements | Ask the checklist questions out loud | 3–5 min |
| Estimate Scale | Write the numbers on the whiteboard | 3 min |
| High-Level Design | Draw components and data flow | 5–8 min |
| Deep Dive | Zoom into the 2–3 hardest parts | 10–15 min |
| Trade-offs | State every decision with "at the cost of" | Woven in |
Covers: YouTube · Netflix · Dropbox · Google Photos · Imgur · Pastebin (media)
One-line pitch: "This is a read-heavy media system. The core strategy is upload-once-serve-many via CDN + async transcoding. Application servers are stateless. Only metadata hits the database — media bytes never do."
- Read:Write ratio ≥ 100:1
- Users upload content, millions stream/view it
- Media files (video, images, docs) are the primary payload
Ask these out loud before drawing anything:
Functional
- Upload and store videos/files?
- Stream or download on playback?
- Multiple quality levels (adaptive bitrate)?
- Search, comments, likes, subscriptions?
Non-Functional (these drive every architecture decision)
- How many DAU? (e.g., 500M)
- Max file size? (e.g., 128 GB for 4K video)
- Acceptable upload latency? (e.g., "upload visible within 5 min")
- Global or single-region?
- Durability requirement? (e.g., 99.999%)
DAU: 500M users
Upload rate: 500K videos/day → ~6 uploads/sec
Watch rate: 5B views/day → ~58K views/sec
Video size avg: 500 MB (1080p, 5 min)
Storage/year: 500K × 500 MB × 365 = ~91 PB/year (raw)
× 3 quality formats ≈ 273 PB/year
CDN cache hit: ~95% (popular content served from edge)
Read:Write ratio: ~10,000:1
| Component | Purpose | Technology |
|---|---|---|
| API Gateway | Auth, rate limiting, routing | NGINX / Kong |
| Upload Service | Receives raw file, writes to S3 | Go / Node.js |
| Message Queue | Decouples upload from processing | Kafka / SQS |
| Transcoding Workers | Encode to 5 resolutions | FFmpeg on EC2/Lambda |
| Object Store | Raw + encoded media files | Amazon S3 |
| CDN | Serve media at edge, ~95% cache hit | CloudFront / Akamai |
| Metadata DB | Video titles, descriptions, user data | PostgreSQL (primary) |
| Cache | Hot metadata, recommendations | Redis (TTL 1 hr) |
| Search Service | Full-text search on titles/tags | Elasticsearch |
| ML Pipeline | Pre-compute recommendations | Spark + Redis |
Upload Flow (Write Path)
Client → API Gateway (auth + rate limit)
→ Upload Service → S3 (raw file, multipart)
→ Kafka "upload-complete" event
→ Transcoding Workers (FFmpeg: 1080p, 720p, 480p, 360p, 144p)
→ Each format → S3
→ Update PostgreSQL: status = READY, cdn_url = ...
→ Invalidate/warm CDN cache
Watch Flow (Read Path)
Client → CDN (cache hit ~95%) → return video bytes
↓ cache miss
→ S3 origin → stream via HTTP Range Requests
(Application servers NEVER touch media bytes)
Metadata / Search
Video metadata → PostgreSQL (writes)
→ Redis cache (reads, TTL 1h)
→ Read replicas (10×) for scale
Search query → Elasticsearch (synced from PostgreSQL via CDC)
Recommendations
Offline ML pipeline (nightly)
→ Pre-computed lists per user
→ Stored in Redis
→ Served in <1ms on homepage load
1. Transcoding at Scale
- Each video spawns 5 parallel encoding jobs
- Workers are stateless, pull from SQS, autoscale (KEDA)
- Progress updates via WebSocket back to uploader
- Failed jobs: retry with exponential backoff (max 3×)
2. CDN Cache Invalidation
- New upload: pre-warm CDN with
Cache-Control: public, max-age=31536000 - Content deletion/update:
POST /invalidationto CDN API (costly—batch them) - Popular vs. long-tail: popular videos always cached; long-tail fetched from S3 on demand
3. View Counter at Scale
- ❌
UPDATE videos SET views = views + 1per request → DB hotspot - ✅ Redis
INCR video:{id}:viewsper request (in-memory, atomic) - Batch flush to PostgreSQL every 60 seconds
- Accept: displayed count may lag by ~60 sec (eventual consistency)
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Media delivery | App server streams | CDN serves directly | CDN | 100× cheaper, lower latency |
| Transcoding | Sync (slow upload) | Async (Kafka+workers) | Async | Upload completes instantly |
| View counts | Exact (DB write per view) | Approximate (Redis batch) | Approximate | 58K/sec → DB can't handle |
| Recommendations | Real-time | Pre-computed (offline) | Pre-computed | Adds ML complexity but instant reads |
"Media bytes flow through S3 and CDN exclusively — application servers never touch them. This is the key insight that makes this system scale."
"Transcoding is async. The upload returns immediately; the user sees 'processing' status while Kafka queues the encoding jobs."
"View counts use eventual consistency deliberately. Showing 1,523,441 vs 1,523,443 is not a business problem. DB consistency at 58K writes/sec is."
Covers: Twitter/X · Instagram · Facebook News Feed · LinkedIn · Reddit · TikTok
One-line pitch: "This is a social feed with massive fan-out. The core challenge is: when a user posts, we need to update millions of followers' feeds without killing the database. The answer is hybrid push/pull — push for regular users, pull for celebrities."
- Users follow other users; post content
- Need to show a personalized, ranked feed
- Write amplification problem: 1 post → N follower feed updates
- Mix of power users (millions of followers) and regular users
Functional
- Post tweets/photos/videos?
- See a personalized home feed?
- Follow/unfollow users?
- Like, repost, comment?
- Real-time notifications?
Non-Functional
- DAU? (e.g., 300M)
- Acceptable feed staleness? (e.g., eventual consistency OK)
- Celebrity threshold for pull vs push? (e.g., 10K followers)
- Feed size limit? (e.g., top 1000 items)
DAU: 300M users
Avg follows: 200 people per user
Posts/day: 100M → ~1,160 posts/sec
Read rate: 10B feed reads/day → ~115K reads/sec
Fan-out factor: avg 200 followers → 200 × 1,160 = 232K write ops/sec
Max celebrity: 100M followers (e.g., Elon Musk)
Read:Write ratio: ~100:1
Feed Redis size: 300M users × 1000 tweet IDs × 8 bytes = ~2.4 TB Redis
| Component | Purpose | Technology |
|---|---|---|
| Write API | Validate and persist post | Node.js / Go |
| Posts DB | Source of truth for all posts | Cassandra (partition: user_id) |
| Fan-out Service | Push tweet IDs to follower feeds | Kafka consumers |
| Feed Cache | Per-user feed (list of post IDs) | Redis List (LPUSH/LTRIM) |
| Feed API | Hydrate feed: IDs → full posts | Go service |
| Object Store | Photos, videos | S3 + CDN |
| Graph DB / Table | Follower/following relationships | PostgreSQL or Cassandra |
| Notification Service | Push notifications, emails | Kafka → FCM/APNs |
| Celebrity Service | Pull recent posts for mega-accounts | Separate read path |
Post Creation (Write Path)
Client → POST /tweet → API Gateway
→ Validate (auth, rate limit, content policy)
→ Write to Cassandra: {tweet_id, user_id, content, media_url, timestamp}
→ Publish to Kafka: "new-post" topic {tweet_id, user_id, follower_count}
Fan-out Service (Async)
Kafka Consumer reads "new-post"
→ IF follower_count < 10,000 (regular user):
Fetch follower list from Graph DB
For each follower:
LPUSH user:{follower_id}:feed {tweet_id}
LTRIM user:{follower_id}:feed 0 999 (keep latest 1000)
→ IF follower_count ≥ 10,000 (celebrity):
SKIP fan-out (too expensive)
Tag tweet as "celebrity post" in a separate sorted set
Feed Read (Read Path)
GET /feed → Redis LRANGE user:{id}:feed 0 19 (20 tweet IDs)
+ ZRANGE celebrity_posts:{user_id} 0 4 (5 latest celebrity posts)
→ Merge and deduplicate by timestamp
→ Batch fetch tweet content + user profiles (Redis or Cassandra)
→ Return ranked feed
1. The Celebrity Problem
- Pushing to 100M followers on every tweet = 100M Redis writes in seconds
- Solution: Don't. At read time, pull the last 20 tweets from celebrities you follow and merge with your pre-built feed
- Merge logic: union of push-feed + celebrity pulls, sorted by timestamp, deduped
2. Feed Ranking
- Chronological: trivial (sort by timestamp)
- Algorithmic (like Instagram): offline ML scores each post → score stored in feed cache → sort by score, not time
3. Hot User Problem
- A celebrity posts → millions of fans' feeds need updating
- Solution: Kafka partitioning by user_id ensures ordering. Fan-out workers autoscale. Rate-limit the writes (eventual consistency — fans see the tweet in seconds, not milliseconds)
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Feed delivery | Push (fan-out on write) | Pull (on demand) | Hybrid | Pure push kills DB for celebrities; pure pull is slow |
| Feed storage | DB query | Redis pre-built list | Redis | 115K reads/sec → DB can't keep up |
| Post storage | SQL (joins) | Cassandra (denormalized) | Cassandra | 1,160 writes/sec, timeline queries by user |
| Celebrity threshold | None (always push) | 10K followers | 10K | Empirical sweet spot (Twitter's actual threshold) |
"The core tension is write amplification. If we push to all followers on every post, a celebrity with 100M followers causes 100M Redis writes per tweet. The hybrid model is the industry-standard solution."
"Feed reads come from Redis — they are O(1). The only computation is at write time (fan-out) and at merge time (celebrity pull + rank)."
"I'd use Cassandra for posts because it's optimized for time-series writes and timeline queries by user_id with timestamp ordering."
Covers: WhatsApp · Slack · Messenger · Discord · iMessage (server side)
One-line pitch: "This is a real-time bidirectional messaging system. The key insight is: HTTP is request-response and doesn't work for real-time push. We need persistent WebSocket connections, routing via Redis, and Cassandra for message storage because it's optimized for high-write, time-ordered queries."
- Messages must be delivered in milliseconds (not seconds)
- Both sender and receiver need to see updates without polling
- Offline message delivery required
- Message ordering guarantees matter
Functional
- 1:1 and/or group messages?
- Online/offline status (presence)?
- Message delivery receipts (sent/delivered/read)?
- Media (images, files)?
- Message history / search?
Non-Functional
- DAU? (e.g., 500M)
- Max group size? (e.g., 1000 members)
- Message delivery latency? (e.g., < 100ms for online users)
- Message retention? (e.g., 7 years)
DAU: 500M users
Messages/day: 100B → ~1.16M messages/sec
Avg message size: 1 KB (text) + media URL
Storage/day: 100B × 1KB = ~100 TB/day
Concurrent WS conn: ~100M at peak (20% DAU online)
Messages per user: 200/day
| Component | Purpose | Technology |
|---|---|---|
| WebSocket Servers | Maintain persistent connections | Golang (high concurrency) |
| Connection Router | user_id → server mapping | Redis Hash |
| Chat Service | Route, persist, fan-out messages | Go microservice |
| Message Store | Ordered message history | Cassandra |
| Redis Pub/Sub | Cross-server message delivery | Redis |
| Presence Service | Online/offline status | Redis (TTL-based heartbeat) |
| Notification Service | Push to offline users | FCM / APNs |
| Media Service | Upload/serve files | S3 + CDN |
Connection Setup
User opens app → WebSocket handshake to Chat Server
Load balancer: consistent hashing on user_id → same server on reconnect
Chat Server: HSET user_connections {user_id: server_address} (Redis)
SETEX presence:{user_id} 30 "online" (Redis, heartbeat every 15s)
Send Message
Sender → WebSocket → Chat Server A
Chat Server A:
1. Validate (auth, content policy, rate limit)
2. Persist to Cassandra:
- partition key: conversation_id
- clustering key: message_id (Snowflake ID, time-ordered)
- fields: sender_id, content, type, status=SENT, timestamp
3. PUBLISH to Redis channel "conv:{conversation_id}" {message payload}
4. ACK sender: {message_id, status: SENT}
Deliver to Recipient
All Chat Servers subscribed to "conv:{conversation_id}"
→ Server B has recipient's WebSocket
→ Server B receives message via Redis Pub/Sub
→ Server B pushes to recipient's WebSocket
→ Recipient ACKs → Server B updates Cassandra: status=DELIVERED
→ Recipient opens message → status=READ → broadcast read receipt
Offline Delivery
Redis Pub/Sub delivery fails (no server has recipient's socket)
→ Chat Service: RPUSH offline:{user_id} {message_id}
→ Notification Service: sends FCM/APNs push notification
→ User comes online → WebSocket connects → LRANGE offline:{user_id} 0 -1
→ Fetch full messages from Cassandra → deliver in order → clear offline queue
1. Message Ordering
- Use Snowflake IDs as message_id: 41-bit timestamp + 10-bit machine ID + 12-bit sequence
- Cassandra clustering key = message_id → physical time ordering guaranteed per partition
- If two messages arrive simultaneously: Snowflake ensures strict ordering
2. Large Groups (Slack channels with 10K+ members)
- Don't fan-out via Redis Pub/Sub (too many server hops)
- Use Kafka: message → Kafka topic
"group:{id}"→ each Chat Server that has a member online consumes the topic - Servers deduplicate delivery to their connected members
3. Message Guarantee
- At-least-once delivery: ACK from client required; retry if no ACK in 5s
- Exactly-once display: client deduplicates by message_id before rendering
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Transport | HTTP polling | WebSocket | WebSocket | Polling = high latency + server load |
| Message DB | MySQL | Cassandra | Cassandra | Write-heavy, time-ordered, scales horizontally |
| Cross-server routing | Sticky sessions | Redis Pub/Sub | Redis Pub/Sub | Stateless servers, no sticky session hell |
| Group fan-out | Redis Pub/Sub | Kafka | Kafka for large groups | Pub/Sub can't fan-out to 10K+ members reliably |
"The fundamental problem with HTTP is that the server can't push to the client. WebSockets solve this with a persistent, full-duplex connection."
"I store messages in Cassandra partitioned by conversation_id. This means fetching a conversation's history is a single-partition scan — extremely fast."
"Redis is the routing table: it maps each online user to the Chat Server that holds their WebSocket. Without this, Server A can't know that the recipient is on Server B."
Covers: Uber/Lyft (ride matching) · Yelp (business search) · Google Maps · Tinder (proximity) · DoorDash
One-line pitch: "This is a geospatial system. The core challenge is: how do you efficiently query 'find all drivers within 5km of this point' over millions of live locations? The answer is Redis GEORADIUS using geohash — sub-millisecond for millions of points, updated every few seconds."
- Need to find things near a point (drivers, restaurants, users)
- Location data updates in real-time
- Queries like "nearest X within Y km"
Functional (Uber)
- Drivers update their location continuously?
- Rider requests a ride → match to nearest available driver?
- Calculate ETA and route?
- Trip tracking in real-time?
Non-Functional
- How many active drivers at peak? (e.g., 1M)
- Location update frequency? (e.g., every 4 sec for drivers)
- Matching latency? (e.g., < 1 sec to show available drivers)
- Geographic scope? (global vs. city-by-city)
Active drivers at peak: 1M worldwide
Location updates: 1M drivers × 1 update/4sec = 250K writes/sec
Rider requests: 10M rides/day → ~115 requests/sec
Search radius: 5 km
Geohash precision 6: covers ~1.2km × 0.6km cell
Redis GEORADIUS query: <1ms for 1M points
Location data size: 1M drivers × 40 bytes = ~40 MB (trivially fits in RAM)
| Component | Purpose | Technology |
|---|---|---|
| Location Service | Ingest driver location updates | Golang + Redis |
| Redis Geo | Real-time spatial index | GEOADD, GEORADIUS |
| Matching Service | Match rider to nearest driver | Go + Redis |
| Trip Service | Manage active trip state | PostgreSQL |
| Route Service | Shortest path computation | In-memory road graph (Dijkstra/A*) |
| ETA Service | Estimate arrival time | Route + traffic data |
| Cassandra | Historical location/trip data | Partitioned by driver_id |
| WebSocket Server | Real-time position streaming to rider | Go |
Driver Location Updates
Driver app (every 4 sec) → HTTPS or WebSocket → Location Service
Location Service:
GEOADD active_drivers {longitude} {latitude} {driver_id}
SET driver:{id}:meta {status, rating, car_type} EX 30 (TTL: driver goes offline after 30s without update)
Rider Requests a Ride
Rider: POST /ride {pickup_lat, pickup_lng, car_type}
Matching Service:
GEORADIUS active_drivers {lat} {lng} 5 km ASC COUNT 10
→ returns [driver_id_1, driver_id_2, ...] sorted by distance
→ Filter by car_type, rating, availability
→ Offer trip to closest driver (timeout 10 sec)
→ Driver accepts → create Trip in PostgreSQL
→ Driver declines → offer to next driver
Real-time Trip Tracking
Driver location → Location Service → Redis Geo (update position)
→ WebSocket push to rider: {driver_lat, driver_lng}
Rider app renders live driver position on map (updates every 4 sec)
ETA Calculation
Route Service: pre-loaded road graph from OpenStreetMap (in memory)
Run Dijkstra/A* from driver_location → rider_location
ETA = sum(segment_distance / segment_avg_speed)
Traffic data: refreshed every 5 min from aggregate probe vehicle data
Business Search (Yelp variant)
"Italian restaurants within 2km, rating > 4.0"
→ Elasticsearch: text match on cuisine + rating filter
→ Geohash filter: prefix match on geohash of search point
(neighbors share geohash prefix → efficient bounding-box query)
→ Merge results, return sorted by relevance × distance
1. Geohash vs QuadTree
| Geohash | QuadTree | |
|---|---|---|
| Storage | String key in Redis | Tree structure in memory |
| Range query | String prefix match | Recursive split |
| Dynamic updates | O(1) with Redis | Rebalancing needed |
| Best for | Real-time driver updates | Static business data |
2. Hot Geohash Cells
- Problem: Manhattan rush hour — 10K drivers in one tiny geohash cell
- Solution: Dynamic radius expansion (start at 1km, expand if fewer than 3 drivers found)
3. Driver Goes Offline
- Redis key TTL: if no update in 30 sec, driver entry auto-expires from the geo index
- Matching service gets stale entry? Ping driver directly; if no ACK, skip
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Location store | PostgreSQL (PostGIS) | Redis Geo | Redis | PostGIS adds query latency; 250K updates/sec needs in-memory |
| Spatial index | QuadTree | Geohash | Geohash | Easier to implement in Redis, O(1) updates |
| Driver matching | Polling (rider polls) | Push (server pushes) | Push via WebSocket | Polling creates unnecessary load; WS gives real-time UX |
| Route computation | External API | In-house A* | In-house | External API = cost + latency + vendor lock-in at scale |
"The entire active driver dataset — 1 million locations — fits in ~40 MB of RAM. Redis GEORADIUS queries this in under a millisecond. This is why Redis is the right tool here, not PostGIS."
"Geohash converts 2D coordinates into a 1D string. Nearby locations share string prefixes, so a prefix query returns all neighbors. That's the magic that makes geo queries O(prefix_length)."
"Drivers that stop updating their location are automatically evicted from the index via Redis TTL — no cleanup job needed."
Covers: TinyURL · Bit.ly · Pastebin · Short links in any product · Snowflake ID generation
One-line pitch: "This is a unique ID generation + redirect system. Extremely read-heavy (100:1). The entire redirect path should be served from Redis cache — the database is only hit on cache misses and new URL creation."
- Need to generate unique short codes or IDs at scale
- Massive read volume vs. tiny write volume
- Redirect latency must be < 10ms
Functional
- Shorten any URL? Custom aliases?
- Expiry dates on short links?
- Analytics (click counts, geographic data, referrers)?
- User accounts and link management?
Non-Functional
- How many new URLs/day? (e.g., 100M)
- How many redirects/day? (e.g., 10B)
- Required uniqueness space? (How long should the code be?)
- Availability requirement? (e.g., 99.99%)
New URLs/day: 100M → ~1,160 writes/sec
Redirects/day: 10B → ~115K reads/sec
Read:Write ratio: ~100:1
Short code length: 7 chars Base62 = 62^7 = 3.5 trillion unique URLs (enough for 1000 years)
Storage/URL: ~500 bytes (URL + metadata)
Storage/year: 100M/day × 365 × 500B = ~18 TB/year
Redis cache: Hot 20% of URLs serve 80% of traffic
20% of 10B = 2B unique short codes × 500B = ~1 TB Redis
| Component | Purpose | Technology |
|---|---|---|
| Write API | Accept long URL, return short code | Go / Python |
| ID Generator | Generate unique short codes | Counter + Base62 OR Hash |
| Redirect Service | 302 redirect from short → long URL | Go (stateless) |
| Cache | short_code → long_url lookups | Redis (LRU eviction) |
| Database | Source of truth for all URLs | PostgreSQL |
| Analytics Pipeline | Click events → aggregated stats | Kafka → ClickHouse |
| Expiry Worker | Remove expired URLs | Cron + PostgreSQL |
Short Code Generation — Two Approaches
Option A: Hash-based
MD5/SHA256(long_url) → take first 7 chars
Collision: try chars 2-8, then 3-9, etc.
✓ Same URL always produces same code (dedup built-in)
✗ Hash collisions require retry logic
Option B: Counter + Base62 (preferred)
Global counter (Redis INCR or DB sequence) → encode to Base62
Base62 alphabet: a-z A-Z 0-9 (62 chars)
Counter 1 → "000000b", 1B → "15ftgG"
✓ No collisions. Guaranteed unique. Deterministic length.
✗ Predictable (users can enumerate URLs if sequential)
Fix: add salt or shuffle the Base62 alphabet
Database Schema
CREATE TABLE urls (
short_code VARCHAR(10) PRIMARY KEY,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count BIGINT DEFAULT 0,
INDEX idx_user_id (user_id)
);Redirect Flow
GET /{short_code}
→ Redis GET short_code (cache hit <1ms → 302 redirect to long_url)
→ Cache miss → PostgreSQL SELECT long_url WHERE short_code = ?
→ if not found: 404
→ if expired: 410 Gone
→ found: SET Redis short_code long_url EX 86400 (24h TTL)
→ async push click event to Kafka
→ 302 redirect
301 vs 302 Redirect
301 Permanent: browser caches the redirect
✓ Subsequent clicks never hit your server (zero infra cost)
✗ You lose all click analytics (browser skips your server)
302 Temporary: browser always hits your server
✓ Every click goes through you → full analytics
✗ Higher server load
→ Use 302 if analytics matter (it almost always does)
Analytics
Click event → Kafka → Stream processor → ClickHouse
Metrics per short_code: total clicks, unique visitors, countries, referrers, devices
Serve analytics dashboard via ClickHouse aggregation queries
1. Cache Eviction Strategy
- LRU eviction: evict least recently used short codes
- Most short links are viral for a day, then dead → LRU is perfect here
- Cache hit rate ~95% for popular links (20% of URLs = 80% of traffic)
2. Expiry
- Store
expires_atin PostgreSQL - On redirect: check expiry before returning long_url
- Cleanup: nightly cron
DELETE FROM urls WHERE expires_at < NOW() - For immediate expiry: Redis TTL auto-removes the cache entry
3. Custom Aliases
- User provides
abc123→ check for collision → write to DB - If collision: return error ("alias taken"), ask user to choose another
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| ID generation | Hash (MD5) | Counter + Base62 | Counter | No collisions; simpler retry logic |
| Redirect code | 301 Permanent | 302 Temporary | 302 | Analytics > infra savings |
| Cache strategy | Write-through | Cache-aside | Cache-aside | Populate only on first read; saves memory |
| Analytics | Sync (in redirect) | Async (Kafka) | Async | Don't add redirect latency for analytics |
"7 characters of Base62 gives us 3.5 trillion unique short codes — that's enough for 1,000 years at 100M URLs/day. Length is a function of the uniqueness space, not aesthetics."
"We use 302 redirects, not 301. 301 means the browser permanently caches the redirect — you never see the click again. For any product with analytics, 302 is the only reasonable choice."
"The entire read path (115K redirects/sec) goes through Redis. The database handles maybe 5K requests/sec (cache misses + writes). This is how we scale to 10B redirects/day on modest hardware."
Covers: API rate limiting · DDoS protection · Per-user quotas · Throttling in any microservice
One-line pitch: "Rate limiting enforces per-user or per-endpoint request quotas. The core choice is algorithm (Token Bucket is best for bursts) and where to enforce (API Gateway + shared Redis state). The hardest part is making the check atomic to avoid race conditions."
- Need to prevent any single user or IP from overwhelming a service
- Different tiers of users have different quotas
- Must be enforced across multiple API server instances
Functional
- Rate limit by user ID? IP address? API key? Tenant?
- Different limits per endpoint? (e.g., /search = 10/sec, /post = 100/day)
- Tiered limits (free vs. pro vs. enterprise)?
- Soft limits (throttle) or hard limits (reject with 429)?
Non-Functional
- How many API servers? (i.e., must Redis be shared)
- Acceptable rate-limit check latency? (< 1ms add)
- What happens if Redis is down? (fail-open vs fail-closed)
API requests: 100K req/sec
Rate-limit checks: 100K/sec (one per request)
Unique users: 10M active
Redis key per user: ~100 bytes
Total Redis size: 10M × 100B = ~1 GB (trivial)
Redis throughput: 100K ops/sec (well within single Redis node limits)
Rate-limit latency: <1ms added to each request
| Component | Purpose | Technology |
|---|---|---|
| API Gateway | Intercepts every request, calls Rate Limiter | Kong / NGINX / Envoy |
| Rate Limiter Service | Stateless workers; checks against Redis | Go microservice |
| Redis Cluster | Shared state: per-user counters and tokens | Redis (single-digit ms) |
| Config Store | Quota rules per tier/endpoint | Redis or DB |
| Monitoring | Alert on abuse patterns | Prometheus + Grafana |
Algorithm: Token Bucket (Recommended)
Each user has a "bucket" with:
- capacity C (max burst)
- refill rate R tokens/second
On each request:
tokens = current_tokens + (elapsed_seconds × R)
tokens = min(tokens, C) # cap at capacity
if tokens >= 1:
tokens -= 1 → ALLOW request
else:
→ REJECT with HTTP 429
Stores per user: {last_refill_timestamp, current_tokens}
Other Algorithms (know when to mention each)
| Algorithm | Best For | Trade-off |
|---|---|---|
| Token Bucket | Bursts allowed (most APIs) | Slightly complex state |
| Fixed Window | Simple quota (100/hour) | Boundary spikes (99 req at :59, 100 at :00) |
| Sliding Window Log | Exact enforcement | Memory: stores all timestamps |
| Sliding Window Counter | Approximate + memory-efficient | Interpolates between windows |
Atomic Redis Implementation (Lua Script)
-- Lua script runs atomically on Redis server (no race conditions)
local key = KEYS[1] -- "ratelimit:{user_id}"
local capacity = ARGV[1] -- 100
local refill_rate = ARGV[2] -- 10 per second
local now = ARGV[3] -- current timestamp (ms)
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
local elapsed = (now - last_refill) / 1000
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- ALLOW
else
return 0 -- REJECT
endResponse Headers (always return these)
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1703030400 (Unix timestamp when tokens refill)
Retry-After: 30 (seconds until next allowed request)
Tiered Limits
User tier stored in JWT / user profile:
Free: 100 req/min, 1,000 req/day
Pro: 1,000 req/min, 50,000 req/day
Enterprise: custom (stored in Config DB)
Rate Limiter reads tier from user context → applies correct bucket config
1. Distributed Race Condition
- Problem: Two API servers check simultaneously → both see tokens = 1 → both allow → effectively 2 tokens consumed
- Solution: Lua script evaluated atomically on Redis → no TOCTOU race condition
- Alternative: Redis
MULTI/EXEC(transaction) — less flexible but simpler
2. Redis Failure Modes
| Strategy | Behavior | Use When |
|---|---|---|
| Fail open | Allow all requests if Redis is down | External-facing APIs (availability > security) |
| Fail closed | Reject all requests if Redis is down | Financial APIs, billing endpoints |
| Local fallback | Each server maintains its own counter | Best balance (eventual consistency) |
3. Distributed Rate Limiting Across DCs
- Each DC has its own Redis cluster
- Limit = global_limit / num_DCs (e.g., 100 req/min → 50 per DC)
- Slight over-allowance at DC boundaries is acceptable
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Algorithm | Fixed Window | Token Bucket | Token Bucket | Handles bursts; no boundary spike problem |
| Atomicity | MULTI/EXEC | Lua Script | Lua Script | Fewer round trips; single atomic operation |
| Failure | Fail-closed | Fail-open | Fail-open | Availability > occasional quota violation |
| Enforcement | Per server | Centralized Redis | Centralized | Per-server allows N× the quota with N servers |
"The critical requirement for distributed rate limiting is atomicity — the check and decrement must be a single atomic operation. Without this, concurrent requests from multiple API servers will all see 'tokens available' and all be allowed, violating the quota."
"I'd use a Lua script in Redis. The Lua evaluator on Redis runs scripts atomically, so no two scripts can interleave. It's more efficient than MULTI/EXEC because it's a single round trip."
"If Redis goes down, I'd fail open — allow all requests — rather than take down the entire API. We might over-serve some users for a few seconds, but that's better than complete unavailability."
Covers: Datadog · Prometheus + Grafana · CloudWatch · Application Performance Monitoring (APM)
One-line pitch: "This is a high-ingest time-series system. The core challenges are: (1) ingesting millions of metrics/second without dropping any, (2) querying arbitrary time ranges efficiently, and (3) alerting within seconds of a threshold breach."
- Ingest high-volume time-stamped numeric data
- Need to query: "show me CPU usage for service X over the last 6 hours"
- Alert when a metric crosses a threshold
- Long-term storage with downsampling
Functional
- Which metrics? (CPU, memory, latency, error rate, custom business metrics?)
- Dashboards and real-time visualization?
- Alerting (threshold-based? anomaly detection?)?
- Log correlation with metrics?
Non-Functional
- Ingestion rate? (e.g., 1M metrics/sec)
- Query latency? (e.g., dashboard loads in < 3 sec)
- Retention? (e.g., 1 year full fidelity, 5 years downsampled)
- Number of services/hosts? (e.g., 10K services, 100K hosts)
Services monitored: 10,000
Metrics per service: 50 (CPU, mem, latency p50/p95/p99, error rate, etc.)
Hosts per service: 10 avg
Metrics/sec: 10K × 50 × 10 × 1/10sec = 500K metrics/sec
Data point size: ~40 bytes (timestamp 8B + value 8B + labels 24B)
Raw storage/day: 500K × 40B × 86,400 = ~1.7 TB/day
After downsampling: 1-min resolution after 24h → 60× compression → ~1 TB/month
| Component | Purpose | Technology |
|---|---|---|
| Agents/Exporters | Collect metrics from each host | Prometheus exporter / Telegraf |
| Ingest API | HTTP endpoint for metric push | Go service |
| Kafka | Buffer + replay, decouple ingest from storage | Kafka (3 brokers) |
| Stream Processor | Pre-aggregate and route | Kafka Streams / Flink |
| Time-Series DB | Store metric data | Prometheus / InfluxDB / Cassandra |
| Query Engine | Execute PromQL / SQL | Thanos / VictoriaMetrics |
| Alert Engine | Evaluate rules, fire notifications | Alertmanager / PagerDuty |
| Dashboard | Visualize metrics | Grafana |
| Long-term Store | Compressed historical data | S3 (Thanos sidecar) |
Ingestion Path
Host agent (every 10 sec) → scrape all metrics locally
→ Push batch to Ingest API via HTTP
→ Ingest API → Kafka topic "metrics" (partition by service_name)
Kafka → Stream Processor:
- Validate and parse
- Tag enrichment (add region, team, environment labels)
- Pre-aggregate: sum counters in 30-sec windows (reduces cardinality)
- Route to Time-Series DB
Storage Schema (Cassandra approach)
CREATE TABLE metrics (
service_name TEXT,
metric_name TEXT,
bucket_date DATE, -- partition key (day boundary)
timestamp BIGINT, -- clustering key (nanoseconds)
value DOUBLE,
tags MAP<TEXT, TEXT>,
PRIMARY KEY ((service_name, metric_name, bucket_date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC);
-- Query: "CPU usage for web-server on 2024-01-15 between 14:00 and 16:00"
-- → single partition scan, no cross-partition joins
Downsampling Policy
Raw (10-sec resolution): keep for 24 hours
1-min resolution: keep for 30 days (60× compression)
1-hour resolution: keep for 1 year (360× compression)
1-day resolution: keep indefinitely
Run nightly downsampling job:
SELECT * FROM metrics WHERE bucket_date = yesterday
→ aggregate into 1-min buckets → write to metrics_1min table
→ delete raw data older than 24h
Alerting
Alert Rules (e.g., "CPU > 90% for 5 consecutive minutes"):
Alert Engine queries TSDB every 60 sec
Evaluates: avg(cpu_usage{service="web"}) over 5m > 0.9
→ If true for 5 checks: fire alert → PagerDuty webhook → SMS/Slack
Anomaly detection:
Z-score: alert if value > mean ± 3σ (detects unknown-unknowns)
Seasonal: ARIMA model for time-of-day seasonality
1. High Cardinality
- Problem:
{user_id=12345}as a label creates one time series per user → billions of series - Solution: Never use high-cardinality labels (user_id, request_id) as metric labels
- Rule: Labels should have bounded cardinality (service, region, status_code — hundreds of values, not millions)
2. Query Performance
- Pre-aggregate on write: store per-minute rollups alongside raw data
- Columnar layout: store values for a metric across time contiguously (great for range scans)
- Caching: popular dashboard queries cached in Redis for 30 seconds
3. Long-term Retention (Thanos pattern)
Prometheus: stores last 2 weeks in local disk
Thanos Sidecar: continuously uploads TSDB blocks to S3
Thanos Query: can query both local Prometheus AND S3 history
→ Infinite retention at S3 cost (~$23/TB/month)
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Storage | SQL | Cassandra / InfluxDB | Cassandra | Time-series workload: high-write, range-scan by time |
| Ingest buffering | Direct DB write | Kafka buffer | Kafka | Protects DB from ingest spikes; enables replay |
| Downsampling | None (store all raw) | Tiered retention | Tiered | 5-year raw retention = petabytes; impractical |
| Alert detection | Threshold only | Threshold + anomaly | Both | Threshold catches known issues; anomaly catches surprises |
"Time-series data has a natural access pattern: you almost always query by service + metric + time range. This makes Cassandra ideal — partition by (service, metric, day) so any time range query is a single partition scan."
"We never let agents write directly to the time-series DB. Kafka acts as a buffer between ingest and storage. This means a 10× traffic spike doesn't crash the database — it just fills up the Kafka queue temporarily."
"Downsampling is not optional — it's an architectural necessity. Storing 500K metrics/sec at 10-sec resolution for 5 years requires ~3 PB. Downsampled to hourly, it's ~8 TB."
Covers: TicketMaster · Airbnb · OpenTable · Hotel booking · Flash sales · Seat selection
One-line pitch: "This is a high-concurrency inventory system where double-booking is catastrophically wrong. The core pattern is: optimistic locking for normal load, a virtual queue for surge events, and Redis counters for read-scaling — with PostgreSQL as the single source of truth."
- Finite, unique inventory (seats, rooms, time slots)
- Must prevent double-booking absolutely
- High-concurrency bursts (Taylor Swift tickets on sale)
- Payment must be atomic with reservation
Functional
- Search available inventory (dates, location, category)?
- Hold a seat/room for checkout (time-limited reservation)?
- Payment integration?
- Cancellation and refunds?
- Waiting list?
Non-Functional
- Peak concurrent users? (e.g., 1M during Taylor Swift sale)
- Reservation hold time? (e.g., 10 minutes to complete payment)
- Overbooking allowed? (e.g., airlines sometimes; concerts never)
- Consistency requirement? (strong — no double-booking ever)
Normal load: 100K concurrent users browsing
Surge load: 10M users hitting at ticket sale start (Taylor Swift)
Venue capacity: 80,000 seats (Allegiant Stadium)
Hold time: 10 minutes
Requests at open: 10M users × 3 req/sec = 30M req/sec for first 10 sec
Target: Sell 80K seats, handle 10M+ concurrent users fairly
| Component | Purpose | Technology |
|---|---|---|
| API Gateway | Auth, rate limiting, routing | Kong / AWS ALB |
| Queue Service | Virtual waiting room for surge | Redis Sorted Set (FIFO by arrival time) |
| Reservation Service | Core booking logic, optimistic lock | Go + PostgreSQL |
| Inventory Cache | Available seat counts | Redis Counter |
| Payment Service | Charge user, confirm booking | Stripe / external payment gateway |
| Notification Service | Email/SMS confirmation | Kafka → SendGrid / Twilio |
| PostgreSQL | Source of truth: seats, bookings | PostgreSQL (ACID) |
| Expiry Worker | Release timed-out holds | Redis TTL + Cron |
Database Schema (Source of Truth)
CREATE TABLE seats (
seat_id BIGINT PRIMARY KEY,
event_id BIGINT NOT NULL,
section VARCHAR(10),
row VARCHAR(5),
number INT,
status ENUM('AVAILABLE', 'RESERVED', 'SOLD') DEFAULT 'AVAILABLE',
reserved_by BIGINT, -- user_id
reserved_until TIMESTAMP, -- hold expiry
version INT DEFAULT 0 -- for optimistic locking
);
CREATE TABLE bookings (
booking_id BIGINT PRIMARY KEY,
user_id BIGINT,
seat_ids BIGINT[],
event_id BIGINT,
amount_paid DECIMAL(10,2),
status ENUM('PENDING', 'CONFIRMED', 'CANCELLED'),
created_at TIMESTAMP DEFAULT NOW()
);Normal Reservation Flow (Optimistic Locking)
POST /reserve {event_id, seat_ids, user_id}
1. BEGIN TRANSACTION
2. SELECT seat_id, status, version
FROM seats
WHERE seat_id IN (...) FOR UPDATE -- pessimistic lock (row-level)
3. IF any seat.status != 'AVAILABLE': ROLLBACK → 409 Conflict
4. UPDATE seats
SET status = 'RESERVED',
reserved_by = user_id,
reserved_until = NOW() + INTERVAL '10 minutes',
version = version + 1
WHERE seat_id IN (...) AND status = 'AVAILABLE'
5. COMMIT → return {reservation_id, expires_at}
Surge Handling: Virtual Queue
Taylor Swift goes on sale at 10:00 AM:
1. Pre-sale: users join queue → ZADD queue:{event_id} {arrival_timestamp} {user_id}
User sees: "You are #234,521 in line. Estimated wait: 12 minutes."
2. Queue processor (every 100ms):
ZPOPMIN queue:{event_id} 500 -- dequeue 500 users
For each user: issue a time-limited "checkout token" (valid 10 min)
Redirect user to checkout with token
3. Checkout: user selects seats, submits payment using checkout token
→ Normal reservation flow above
→ Token consumed (single-use)
This limits concurrent DB load to 500 checkouts every 100ms = 5K/sec
vs. 30M concurrent → DB crash
Payment → Confirmation
User submits payment during hold window:
→ Payment Service: charge card via Stripe
→ On success:
UPDATE seats SET status = 'SOLD' WHERE reservation_id = ?
INSERT INTO bookings (confirmed)
PUBLISH to Kafka "booking-confirmed"
→ Notification Service: email PDF ticket
On failure / expiry:
UPDATE seats SET status = 'AVAILABLE', reserved_by = NULL
PUBLISH to Kafka "seat-released"
→ Increment Redis counter: INCR seats_available:{event_id}
Read Scaling (Availability Queries)
"How many seats left for event X?"
→ Redis: GET seats_available:{event_id} -- O(1), <1ms
→ Never hit PostgreSQL for this query
"Show me available seats on the map?"
→ Redis hash: HGETALL seat_status:{event_id} -- all seat statuses in one call
→ Refresh from DB every 5 sec or on any state change
1. Preventing Double-Booking
SELECT ... FOR UPDATEacquires a row-level lock for the duration of the transaction- No two transactions can hold the same row lock simultaneously → guaranteed no double-booking
- Lock is held only milliseconds (fast DB transaction) — not for the 10-minute hold period
2. Seat Expiry
Option A: Cron job every 60 sec
UPDATE seats SET status='AVAILABLE'
WHERE status='RESERVED' AND reserved_until < NOW()
Option B: Redis TTL (preferred)
SET reservation:{seat_id} {user_id} EX 600 -- 10 min TTL
Redis keyspace notification fires on expiry
→ Listener updates PostgreSQL: status = AVAILABLE
3. Flash Sale: 10M Users, 80K Tickets
- 80K winners, 9.92M losers
- Virtual queue + checkout token = fair FIFO + bounded DB load
- Redis Sorted Set for queue: O(log N) insert, O(1) dequeue
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Concurrency control | Optimistic locking | Pessimistic (FOR UPDATE) | Pessimistic | High contention for popular events; optimistic = too many retries |
| Surge handling | Rate limit (reject) | Virtual queue (wait) | Virtual queue | Users accept waiting; rejection causes user frustration + retry storms |
| Availability queries | DB query | Redis counter | Redis | 10M users checking availability → DB can't handle |
| Hold expiry | Cron job | Redis TTL | Redis TTL | Event-driven, more accurate than polling |
"The
SELECT ... FOR UPDATElocks the seat row for the duration of the transaction — typically under 100ms. This is safe because we never hold DB locks across user-facing wait time. The 10-minute hold period is managed separately via Redis TTL."
"For a Taylor Swift-scale sale, the virtual queue is non-negotiable. Without it, 10M simultaneous requests hit the reservation service, which hits PostgreSQL, which immediately falls over. The queue limits concurrent DB writes to a manageable number."
"Availability counts come from Redis, not PostgreSQL. At 10M concurrent users refreshing every second, that's 10M DB reads/sec for a simple integer — unworkable. Redis serves this in under a millisecond."
Covers: Devin (coding agent) · Cursor · GitHub Copilot Workspace · AutoGPT · Customer support AI agents
One-line pitch: "An AI agent system is an LLM wrapped in a ReAct loop with tools — it Observes context, Thinks about what to do, Acts (calls a tool), then re-Observes the result. The key design challenges are: managing the context window, preventing runaway loops, and sandboxing actions for safety."
- An AI model that takes multi-step actions autonomously
- Needs to use tools (search, execute code, call APIs)
- Must handle failures and retry
- Safety and human oversight are requirements
Functional
- What is the agent's task domain? (code, customer support, research?)
- What tools does the agent need? (web search, code execution, file I/O, API calls?)
- How long can tasks run? (seconds? hours? days?)
- Human-in-the-loop at any step?
Non-Functional
- Max concurrent agents? (cost × compute)
- Max context window? (GPT-4: 128K tokens; Claude: 200K tokens)
- Latency per step? (LLM inference: 1–10 sec/step)
- Safety requirements? (what can the agent NOT do?)
Concurrent sessions: 10,000 agents running simultaneously
Steps per task: avg 20 steps (Observe → Think → Act cycle)
LLM inference/step: ~2 sec (GPT-4 Turbo)
Task duration: avg 40 sec (20 steps × 2 sec)
Cost per task: ~$0.10–$1.00 (varies by model + token count)
Context per step: ~10K tokens (grows with conversation history)
Vector DB queries: ~50ms per RAG retrieval
| Component | Purpose | Technology |
|---|---|---|
| Orchestrator | Manages agent lifecycle + task queue | Go service |
| LLM Gateway | Route to LLM, manage rate limits/retries | Python + LiteLLM |
| Sandbox VM | Isolated execution environment | Firecracker microVM |
| Tool Registry | Catalog of available tools + schemas | YAML/JSON config |
| Memory: Short-term | Current conversation context | In-process (token window) |
| Memory: Long-term | Persistent knowledge, past interactions | Vector DB (Pinecone / Weaviate) |
| Action Bus | Queue and execute tool calls | Kafka |
| Safety Layer | Filter dangerous actions before execution | Rule engine + LLM-as-judge |
| Observability | Trace every step, token, tool call | LangSmith / Datadog |
ReAct Loop (the core pattern)
Task: "Fix the failing test in src/payment/processor.py"
STEP 1: Observe
Context: {task, codebase_summary, available_tools}
Available tools: [read_file, write_file, run_tests, grep, web_search]
STEP 2: Think (LLM inference)
Prompt → LLM → Response:
"Thought: I need to read the failing test file to understand what's breaking.
Action: read_file(path='tests/test_payment.py')"
STEP 3: Act
Tool dispatcher → calls read_file()
Returns: file contents (added to context)
STEP 4: Observe (new context)
Context now includes: task + file contents
REPEAT until LLM outputs: "Thought: I have fixed the bug. Final Answer: ..."
Memory Architecture
Short-term (context window):
Full conversation history up to token limit
When near limit: summarize older steps → compress into single message
Long-term (Vector DB RAG):
Embeddings of: codebase files, documentation, past solutions
On each step: retrieve top-5 relevant chunks
Prompt: "Here is relevant context: [chunks]\n\nNow: [current task]"
Episodic memory:
Store completed tasks + outcomes in DB
Future agents can search: "Have we solved this type of bug before?"
Sandboxed Execution
For code execution:
Spawn fresh Firecracker microVM (boot time ~125ms)
Mount repo at /workspace (read-write)
Set limits: CPU 2 cores, RAM 512MB, wall time 30sec, no network
Syscall filter (seccomp): whitelist only {read, write, open, exec, fork}
Execute command → capture stdout/stderr → return to agent
Destroy VM after each step (ephemeral, no state leaks between tasks)
Safety Layer
Before every tool call:
1. Rule-based filter:
- BLOCK: rm -rf /, git push --force main, curl malicious.com
- WARN: any write outside /workspace
- ALLOW: everything else
2. (Optional) LLM-as-judge:
"Is this action safe? {action, context}" → YES/NO + reason
3. Human approval gate (configurable per task):
- Auto: proceed with safe actions
- Supervised: pause, show human, require confirmation for writes
Multi-Agent Architecture (for complex tasks)
Manager Agent: decomposes task → subtasks → assigns to Worker Agents
Worker A: "Reproduce the bug" (read + run tests)
Worker B: "Research similar bugs in codebase" (RAG search)
Worker C: "Write the fix" (write_file)
Manager: collects results → synthesizes → verifies → delivers
Communication: via shared Task DB (not direct agent-to-agent calls)
1. Context Window Management
- Every step adds to the context (Observe output + LLM response)
- At ~80% of token limit: summarize oldest steps ("I have read X files and found Y...")
- Critical: never truncate without summarizing — agent loses memory of past actions
2. Preventing Infinite Loops
- Hard limit: max 50 steps per task (configurable)
- Loop detection: if LLM generates same action 3× in a row → force terminate
- Cost guard: if task exceeds $5 in LLM spend → pause, alert human
3. Tool Failures
- LLM may hallucinate a tool call:
call_api(url="https://not-real.com/endpoint") - Solution: strict JSON schema for tool calls + validate before execution
- Failed tool → add error to context → LLM retries with different approach (max 3 retries per tool)
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Memory | Full history in context | RAG + summarization | RAG + summarization | Context window is finite; 200K tokens ≈ 150K words |
| Execution | Direct on host | Sandboxed VM | Sandboxed VM | Agents run untrusted code; isolation is non-negotiable |
| Safety | Rule-based only | LLM-as-judge | Both | Rules catch known bad patterns; LLM catches novel ones |
| Architecture | Single agent | Multi-agent | Multi-agent for complex tasks | Parallelism + specialization; single agent bottlenecks on long tasks |
"The ReAct loop — Reason then Act — is the fundamental pattern. The agent doesn't just call a function; it explains its reasoning in the prompt, then takes an action. This makes it debuggable and steerable."
"The hardest engineering problem is context management. Every observation you add to the prompt costs tokens. At 200K tokens, you run out of space mid-task. The solution is progressive summarization — compress old steps into a summary before they hit the limit."
"Safety is an infrastructure concern, not an LLM concern. You can't rely on the model to refuse dangerous actions. Sandboxed VMs, syscall filters, and rule-based blocklists enforce safety at the infrastructure layer."
Covers: Google Search autocomplete · Twitter search · E-commerce search · App global search
One-line pitch: "Typeahead is a latency problem. The user types a character every ~100ms and expects suggestions in under 50ms — total round-trip. The answer is pre-computing top-N completions per prefix and serving them from an in-memory Trie or Redis Sorted Sets, not running live queries."
- Need to show suggestions as the user types
- Latency requirement: < 50ms
- Popularity-ranked results (not alphabetical)
- Large query volume (billions of searches/day)
Functional
- Suggestions based on historical search popularity?
- Personalized suggestions (per user history)?
- Real-time trending (Twitter-style)?
- Filtering by category or type?
Non-Functional
- Daily active users? (e.g., 1B)
- Latency target? (< 50ms)
- Number of distinct queries to index? (e.g., top 10M)
- How fresh must suggestions be? (real-time vs. daily batch)
DAU: 1B users
Searches/day: 5B → ~58K searches/sec
Keystrokes per search: avg 5 (each sends a suggestion request)
Suggestion requests: 58K × 5 = 290K req/sec
Distinct queries: Top 10M (80% of searches covered)
Trie size: 10M prefixes × avg 10 completions × 50 bytes = ~5 GB (fits in RAM)
Redis Sorted Set: 10M prefixes × ~1 KB each = ~10 GB Redis
Latency target: < 50ms (including network)
| Component | Purpose | Technology |
|---|---|---|
| Aggregation Pipeline | Count query frequencies | Kafka → Spark batch job |
| Prefix Computation | Build top-10 per prefix | Spark/Python daily job |
| Trie Service | In-memory prefix tree with completions | Go (custom Trie) |
| Redis Sorted Sets | Alternative: per-prefix ranked completions | Redis ZADD/ZREVRANGE |
| Suggestion API | Serve suggestions to clients | Go (stateless, horizontal scale) |
| CDN Cache | Cache common prefixes at edge | CloudFront |
| Trending Layer | Real-time freshness for trending queries | Kafka Streams → Redis |
Data Collection
Every search → log event: {query: "apple", timestamp, user_id, result_count}
→ Kafka "search-events" topic
→ Daily Spark batch job:
SELECT query, COUNT(*) as frequency
FROM search_events
WHERE date = yesterday
GROUP BY query
ORDER BY frequency DESC
LIMIT 10M
→ Store: {query, frequency} in PostgreSQL
Prefix Computation (build the index)
For each query in top 10M (e.g., "apple"):
Generate all prefixes: "a", "ap", "app", "appl", "apple"
For prefix "app":
Top completions = top 10 queries starting with "app" by frequency
→ ["apple" (5B), "app store" (3B), "application" (1B), ...]
Store in Redis:
ZADD "prefix:app" 5000000000 "apple"
ZADD "prefix:app" 3000000000 "app store"
ZADD "prefix:app" 1000000000 "application"
...
Serving Suggestions
User types "app" (after 100ms debounce):
→ GET /suggest?q=app
→ Suggestion API:
Redis ZREVRANGE "prefix:app" 0 9 -- top 10, descending by score → <1ms
(+ personalization layer: boost user's past searches)
→ Return: ["apple", "app store", "application", ...]
→ Client displays in dropdown
Trie (in-memory alternative)
Trie node structure:
{
char: 'p',
children: {Map<char, TrieNode>},
top_completions: ["apple", "app store"] // cached at each node
}
Lookup "app":
Navigate: root → 'a' → 'p' → 'p'
Return node.top_completions → O(prefix_length) = O(3) → near-zero latency
Update: run nightly batch → rebuild Trie in background → hot-swap pointer
Real-time Trending (for Twitter-style)
Kafka Stream: all search queries in real-time
→ Flink: sliding 5-minute window → count queries
→ ZINCRBY "prefix:{p}" {delta} "{query}" for each prefix of trending query
→ Serves fresher suggestions (Twitter trending hashtags update in real-time)
Two-tier:
Trending layer: last 5 min (Kafka → Flink → Redis)
Historical layer: top queries by all-time frequency (daily batch)
Merge at serve time: blend trending boost into historical score
1. Trie vs. Redis Sorted Sets
| In-Memory Trie | Redis Sorted Sets | |
|---|---|---|
| Lookup latency | O(prefix_len) in-process | O(log N) + network hop |
| Memory | ~5 GB (must fit in RAM) | ~10 GB (Redis) |
| Updates | Rebuild + hot-swap (complex) | Incremental ZADD (simple) |
| Horizontal scale | Shard by prefix | Redis Cluster |
| Best for | Ultra-low latency, high read | Simpler ops, incremental updates |
2. Personalization
- Maintain per-user search history in Redis:
ZADD user:{id}:searches {timestamp} {query} - At serve time: boost score of user's past searches by 2×
- For new users: serve global suggestions only (cold-start handled automatically)
3. CDN Caching
- Very common prefixes ("a", "ap", "go", "py") can be cached at CDN edge
- Cache key: prefix + language (English vs. Spanish vs. etc.)
- TTL: 1 hour (balances freshness with cache hit rate)
- ~80% of all keystrokes hit CDN cache → reduces origin load dramatically
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Index storage | In-memory Trie | Redis ZSET | Redis (for most) | Simpler to operate; incremental updates; Trie for Google-scale |
| Freshness | Daily batch rebuild | Real-time Kafka | Hybrid | Daily = stable historical; Kafka = trending on top |
| Debounce | None (every keystroke) | 100ms debounce | 100ms debounce | 100ms unnoticeable to users; 10× fewer API calls |
| Personalization | Global ranking only | Per-user history boost | Hybrid | Cold start = global; warm user = personalized blend |
"Typeahead is solved by pre-computation, not real-time querying. We pre-compute the top 10 completions for every prefix in the top 10M queries. Then serving a suggestion is a single Redis ZREVRANGE call — under 1ms."
"The debounce is critical. Without it, typing 'apple' sends 5 requests ('a', 'ap', 'app', 'appl', 'apple'). With 100ms debounce, the first 4 keystrokes that are typed faster than 100ms apart collapse into a single request. 10× reduction in API calls with zero perceived latency difference."
"For personalization, I'd maintain a Redis Sorted Set per user of their recent searches. At serve time, I'd boost the score of any suggestion that matches a user's past query. This gives instant personalization with no ML model needed."
Covers: Google Docs · Figma · Notion · Miro · Confluence · Collaborative code editors
One-line pitch: "Collaborative editing is a conflict resolution problem. When two users edit the same position simultaneously, naive last-write-wins corrupts the document. The industry solutions are OT (Operational Transformation, used by Google Docs) — which transforms concurrent ops on a central server — or CRDT (used by Figma) — which makes conflict resolution deterministic and peer-to-peer."
- Multiple users editing the same document simultaneously
- All users must converge to the same state
- Offline editing + reconnect required
- Real-time presence (cursors, selections)
Functional
- Text editing? Rich text? Drawings/shapes (Figma)?
- Real-time cursor/presence visibility?
- Version history / undo?
- Comments and suggestions?
- Offline support?
Non-Functional
- Max concurrent editors per doc? (e.g., 100)
- Max document size? (e.g., 10MB)
- Latency for edit propagation? (e.g., < 200ms)
- Version history retention? (e.g., 30 days)
Documents: 1B
Active docs (editing): 10M concurrently
Editors per doc: avg 3, max 100
Operations/sec/editor: ~10 (fast typing)
Total ops/sec: 10M × 3 × 10 = 300M ops/sec
Op size: ~100 bytes (type, position, character)
Storage/day: 300M × 100B × 86400 = ~2.5 PB/day (raw ops)
Snapshots: every 100 ops → 99% reduction = ~25 TB/day
| Component | Purpose | Technology |
|---|---|---|
| WebSocket Server | Persistent connections per editor | Go / Elixir |
| Document Server | In-memory document state + OT engine | Stateful Go service |
| Operation Log | Append-only history of all operations | PostgreSQL / Cassandra |
| Snapshot Store | Periodic full document state | S3 / PostgreSQL |
| Redis Pub/Sub | Broadcast ops to all editors | Redis |
| Presence Service | Cursors, selections, online status | Redis (ephemeral) |
| OT Engine | Transform concurrent operations | Custom library / ShareDB |
| Load Balancer | Consistent hash: doc_id → same server | NGINX + consistent hash |
Connection & Routing
User opens document → WebSocket connects to Document Server
Load balancer: hash(doc_id) → always routes to SAME Document Server
→ Why: that server holds the doc in memory + the OT state machine
Document Server:
1. Load doc from snapshot (latest snapshot + replay ops since it)
2. Register client in local session map: {user_id, cursor_position, color}
3. Subscribe to Redis channel "doc:{id}:ops"
Edit Flow (OT: Operational Transformation)
User A types "X" at position 5:
Client sends: {op: INSERT, pos: 5, char: "X", clientVersion: 42}
Document Server:
1. Check: server is at version 50, client sent version 42
2. Get all ops from version 42 → 50 from operation log
3. Transform client op against those 8 ops:
- Op 43 was INSERT at pos 3 → client's pos 5 becomes pos 6
- Op 47 was DELETE at pos 1 → client's pos 6 becomes pos 5
→ Transformed: {op: INSERT, pos: 5, char: "X", version: 51}
4. Apply to in-memory document → document version = 51
5. Persist to operation log: {doc_id=X, version=51, op=..., user_id=A, ts=...}
6. Broadcast to all other editors via Redis Pub/Sub: {version=51, op=...}
7. ACK to User A: {serverVersion: 51}
All other editors:
Receive op via WebSocket → apply to local copy
Document converges to same state everywhere
OT Conflict Example
Both User A and User B start from "hello" at version 10:
User A: INSERT 'X' at pos 5 → sends {v:10, INSERT, 5, X}
User B: INSERT 'Y' at pos 5 → sends {v:10, INSERT, 5, Y}
Server receives A first:
Apply A → document = "helloX" at version 11
Transform B's op against A's: position 5 → 6 (A inserted before B's position)
Apply transformed B → document = "helloXY" at version 12
Both A and B receive op 11 (X at 5) and op 12 (Y at 6)
Both converge to: "helloXY" ✓ No data loss. Both characters preserved.
CRDT Alternative (Figma, Notion)
Each character has a globally unique ID: (user_id, local_sequence_number)
Document = ordered set of (ID, char) pairs
Insert: add new (ID, char) pair; order determined by ID (no position conflicts)
Delete: mark as "tombstone" (ID preserved, char = NULL)
Merge: take union of all operations (commutative, associative, idempotent)
→ No server needed for conflict resolution
→ Works offline: queue ops locally, sync on reconnect
→ The merge is always deterministic, regardless of order received
Persistence: Snapshots + Operation Log
Every 100 operations:
Create snapshot: {doc_id, version, full_content, timestamp}
Store in S3 (large) or PostgreSQL (small docs)
Delete operations older than snapshot version (optional, keep for history)
Loading a document:
Fetch latest snapshot (version N)
Fetch ops from version N+1 to current
Apply ops to snapshot → current document state
→ Even with 1M ops in history, load time = snapshot + (ops since last snapshot)
Presence (Cursors & Selections)
Every 100ms: client sends {user_id, cursor_pos, selection_range}
→ Document Server: broadcast to all editors via WebSocket
→ Store in Redis: SETEX presence:{doc_id}:{user_id} 5 {cursor_data} (5-sec TTL)
→ NOT persisted to PostgreSQL (ephemeral — disappears when user leaves)
Client renders: colored cursor + name label for each other editor
1. OT vs CRDT — When to Choose Each
| OT (Google Docs) | CRDT (Figma/Notion) | |
|---|---|---|
| Server needed | Yes (for op serialization) | No (peer-to-peer possible) |
| Offline support | Complex (server must reconcile) | Natural (merge on reconnect) |
| Complexity | Transformation logic is tricky | Tombstones, ordering, size |
| Performance | Fast (server is single source of truth) | CRDT size grows with history |
| Best for | Central-server architecture | P2P or offline-first apps |
2. Server Failover
- Document Servers are stateful → if one crashes, its in-memory doc state is lost
- Solution: every op is persisted to operation log before ACK
- On crash: new server reloads from snapshot + replays ops from log
- Clients detect disconnection → show "reconnecting" → re-subscribe
3. Large Documents
- Document grows over time → in-memory state grows
- Mitigation: create snapshots frequently + stream ops (don't load full history)
- Max document size limit prevents abuse (e.g., 10MB for Google Docs)
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Conflict resolution | OT | CRDT | OT for text, CRDT for Figma | Text editing → OT is simpler; vector graphics → CRDT handles spatial data better |
| Server architecture | Stateless | Stateful (doc in memory) | Stateful | OT requires serialized op processing; stateless would need distributed locks |
| Presence | Persistent (DB) | Ephemeral (Redis TTL) | Ephemeral | Cursor positions are meaningless after user leaves; save storage |
| Snapshot frequency | Rarely (save storage) | Frequently (fast load) | Every 100 ops | Balance load time vs. storage; 100 ops replays in <10ms |
"The core problem is: two users type at the same position simultaneously. Last-write-wins silently deletes content — unacceptable. OT transforms concurrent operations to adjust positions, so both characters are always preserved."
"The Document Server must be stateful — one server owns each document. This is different from typical stateless HTTP services. The load balancer uses consistent hashing on document_id so all editors of the same doc land on the same server."
"Presence (cursors) is ephemeral by design. We use Redis with a 5-second TTL. When a user leaves, their cursor disappears automatically. We never write cursor positions to PostgreSQL — it would create millions of meaningless writes."
Covers: LeetCode · HackerRank · Codeforces · GitHub Actions (CI runners) · Repl.it · Code Interview platforms
One-line pitch: "Code execution is a security problem first, a performance problem second. User-submitted code is fundamentally untrusted — it can do anything. The solution is mandatory sandbox isolation (Firecracker VM / Docker with seccomp) per submission, async job queues to handle bursts, and strict resource limits enforced at the kernel level."
- Need to execute arbitrary user-submitted code
- Results must be delivered back to the user
- Security: one user's code must not affect another's
- Burst traffic (contest start: 50K submissions simultaneously)
Functional
- Which languages? (Python, Java, C++, etc.)
- Time limit per submission? (e.g., 2 seconds)
- Memory limit? (e.g., 256 MB)
- Test case visibility? (show test cases to user or not?)
- Plagiarism detection?
Non-Functional
- Submissions/sec at peak? (e.g., 10K/sec during contest start)
- Max queue depth? (acceptable wait time? e.g., < 10 sec)
- Result delivery: polling vs. WebSocket push?
Normal load: 1K submissions/sec
Contest peak: 50K submissions/sec for first 60 sec
Execution time: avg 1 sec (Python), 0.1 sec (C++)
Worker needed: 50K submissions × 1 sec execution = 50K workers at peak
Autoscale time: KEDA scales from 100 → 5K workers in ~2 min
Queue depth: 50K submissions × (1 sec processing - 1/50K arrivals) ≈ 10 sec wait
Sandbox boot: Docker ~500ms, Firecracker ~125ms
Result storage: 50K × 10 KB = ~500 MB/sec peak (Redis TTL)
| Component | Purpose | Technology |
|---|---|---|
| Submit API | Accept submission, enqueue job | Go / Node.js |
| Job Queue | Buffer submissions, decouple load | SQS / Kafka |
| Worker Fleet | Execute code in sandbox | Kubernetes + KEDA autoscaler |
| Sandbox | Isolated execution environment | Firecracker microVM or Docker + seccomp |
| Language Runtimes | Pre-installed interpreter/compiler | Docker images per language |
| Result Cache | Store results (TTL 1 hour) | Redis |
| Result DB | Persistent submission history | PostgreSQL |
| WebSocket Server | Push result to user when ready | Go |
| Test Case Store | Input/output files per problem | S3 |
Submission Flow
POST /submit {code, language, problem_id, user_id}
→ API Gateway:
1. Authenticate user
2. Validate: size < 64KB, language in whitelist
3. Generate submission_id (Snowflake)
4. Enqueue to SQS: {submission_id, user_id, problem_id, code, language}
5. Return HTTP 202 Accepted + {submission_id}
(Never block HTTP request waiting for code execution!)
Worker Execution
Worker pops job from SQS:
1. Fetch test cases from S3: {input_1.txt, output_1.txt, ...}
2. Pull language runtime image (pre-cached on worker)
3. Spawn Firecracker microVM:
- Mount code file at /code/solution.{ext}
- Mount test cases at /tests/
- Limits: CPU 1 core (CPU quota), RAM 256MB (cgroup), wall time 2s (SIGKILL)
- Network: BLOCKED (no internet access from sandbox)
- Filesystem: read-only except /tmp (tmpfs, 64MB)
- Syscalls: seccomp whitelist (no fork bomb: clone restricted; no exec of arbitrary binaries)
4. Run: {python3 /code/solution.py < /tests/input_1.txt}
5. Compare stdout to /tests/output_1.txt
6. Collect: {verdict, runtime_ms, memory_mb, stdout_snippet, stderr_snippet}
7. Destroy VM immediately (no shared state between submissions)
Result Storage & Delivery
Worker writes result:
1. Redis SET submission:{id} {result_json} EX 3600 (1-hour TTL)
2. PostgreSQL INSERT submissions (user_id, problem_id, verdict, runtime_ms, code, timestamp)
3. Publish to Kafka "results": {submission_id, user_id, verdict}
Result delivery:
Option A: Polling
Client GET /submissions/{id}/result every 1 sec
Returns 202 (pending) or 200 + result (complete)
Simple but adds 0-1 sec delay + unnecessary load
Option B: WebSocket Push (preferred)
User connects WebSocket on submission page
Result Service subscribes to Kafka "results" topic
On result event: push directly to user's WebSocket
→ Result appears in < 100ms of execution completing
Contest Autoscaling
Normal: SQS queue depth ~0, 100 workers idle
Contest starts: 50K submissions arrive in 60 sec
→ SQS queue depth spikes to 50K
→ KEDA (Kubernetes autoscaler) monitors queue depth:
if depth > 1000: add workers (scale out)
target: 1 worker per 100 queue items
max workers: 5,000 (capacity limit)
→ Workers scale 100 → 5,000 in ~120 sec
→ Queue drains within ~10 min for a 50K submission burst
→ After contest: KEDA scales back to 100 workers (saves cost)
Security Layers
Layer 1: Input validation → size limit, language whitelist
Layer 2: Container image → minimal base image (no bash, no curl, no gcc unless needed)
Layer 3: User namespace → run as non-root inside container
Layer 4: Seccomp filter → syscall whitelist (prevents fork bomb, exec abuse)
Layer 5: cgroup limits → CPU/memory/disk enforced at kernel level
Layer 6: Network policy → Kubernetes NetworkPolicy blocks all egress
Layer 7: Wall time limit → SIGKILL after 2 sec (prevents infinite loops)
Layer 8: VM isolation → Firecracker: separate kernel per submission (strongest)
"Defense in depth" — all layers fail open to the next
1. Docker vs. Firecracker
| Docker + seccomp | Firecracker microVM | |
|---|---|---|
| Boot time | ~500ms | ~125ms |
| Isolation | Process (shared kernel) | Hardware (separate kernel) |
| Security | Good (seccomp limits) | Excellent (kernel exploit can't escape) |
| Overhead | Low | ~10MB RAM per VM |
| Best for | Trusted code (CI pipelines) | Untrusted user code (LeetCode) |
2. Preventing Fork Bombs
# This kills a normal Linux system:
import os
while True:
os.fork()
# Our defense:
# 1. seccomp: restrict clone() syscall to limit process creation
# 2. cgroup pids.max = 50 (max 50 processes per submission)
# 3. SIGKILL after 2 seconds (wall clock)
→ Fork bomb creates 50 processes max, then killed after 2 sec3. Test Case Security
- Problem: if test cases are sent to the worker, a malicious user could exfiltrate them from the sandbox
- Solution: test cases never leave S3 unencrypted; mounted read-only into VM; output comparison happens outside the sandbox
| Decision | Option A | Option B | Our Choice | Why |
|---|---|---|---|---|
| Execution model | Sync (HTTP request blocks) | Async (queue + poll/push) | Async | Code execution takes 1–10 sec; HTTP timeout risk + thread starvation |
| Isolation | Docker | Firecracker | Firecracker | Stronger kernel isolation; user code is untrusted |
| Result delivery | Polling | WebSocket push | WebSocket | Instant UX; polling adds latency + unnecessary load |
| Scaling | Fixed fleet | KEDA autoscale | KEDA | Contest bursts are 50× normal load; fixed fleet wastes money 95% of the time |
"The HTTP endpoint returns 202 Accepted immediately — it never waits for code to execute. This is critical. A 2-second execution time × 50K concurrent submissions = 100K threads blocked if we handle it synchronously. The queue decouples submission from execution."
"Security is infrastructure, not trust. We don't trust the model to refuse malicious code — we enforce limits at the kernel level: seccomp syscall filter, cgroup resource limits, and network policy. Even if the code escapes the application layer, it can't escape cgroups."
"KEDA autoscaling is the cost story. Maintaining 5,000 workers 24/7 to handle contest peaks costs $50K/month in EC2. With autoscaling, we pay for 100 workers normally and burst to 5,000 only during the 10 minutes of the contest start."
Read the question. Identify the core constraint. Pick the template. You'll know what to say.
| If the question mentions… | Template | Core Challenge |
|---|---|---|
| YouTube, Netflix, Dropbox, media streaming | 1 — Read-Heavy | CDN + async transcoding |
| Twitter, Instagram, Facebook feed, news feed | 2 — Social Feed | Fan-out at scale (push vs pull) |
| WhatsApp, Slack, Messenger, chat | 3 — Real-time Chat | WebSocket routing + offline delivery |
| Uber, Yelp, Maps, "nearest X" | 4 — Location | Geospatial index (Redis Geo) |
| TinyURL, bit.ly, short links | 5 — URL Shortener | Unique ID + redirect at scale |
| Rate limiting, throttling, quotas | 6 — Rate Limiter | Atomic check-and-decrement (Lua) |
| Datadog, Prometheus, monitoring, metrics | 7 — Monitoring | Time-series ingestion + downsampling |
| TicketMaster, Airbnb, booking, reservations | 8 — Booking | Double-booking prevention + surge queue |
| AI agent, Devin, Cursor, autonomous AI | 9 — AI Agent | ReAct loop + context management |
| Typeahead, autocomplete, search-as-you-type | 10 — Typeahead | Pre-computed prefix index |
| Google Docs, Figma, collaborative editing | 11 — Collaborative | OT / CRDT conflict resolution |
| LeetCode, code execution, CI runner, sandbox | 12 — Code Execution | Sandbox isolation + async job queue |
No matter which template you use, weave these into your answer:
| Tension | Say this |
|---|---|
| Consistency vs Availability | "We accept eventual consistency here because [X]. If we needed strong consistency, we'd use [Y] at the cost of [Z]." |
| Latency vs Throughput | "We optimize for [latency/throughput] by [approach]. The trade-off is [cost]." |
| Read vs Write optimization | "This system is [read/write]-heavy, so we optimize the [read/write] path with [cache/queue], accepting [trade-off] on the [write/read] side." |
| Cost vs Performance | "Pre-computing costs storage but makes reads O(1). The alternative — computing on the fly — is cheaper to store but adds [X]ms per request." |
| Simplicity vs Scalability | "I'd start with [simpler approach]. We only add [complexity] when we hit [specific bottleneck]." |











