Notification System Design

Design a system that reliably delivers push, SMS, and email notifications to millions of users with guaranteed delivery, smart deduplication, and user preference management.

Why Notification Systems Are Hard

A "notification system" sounds simple — just send a message. But at scale, notification systems fail in predictable ways: duplicate sends, missed deliveries, thundering herd fan-out, and carrier-specific rate limits. A Facebook-scale system might send 100,000+ notifications per second across 3+ channels. Every engineering decision here has a direct user-experience consequence.

The 4 Core Challenges

Challenge	Why it's hard	Solution
Guaranteed delivery	Networks fail, services crash, devices are offline	Persistent queue + retry with exponential backoff
Fan-out at scale	1 event → millions of notifications instantly	Async workers + Kafka fan-out
Deduplication	Retries can cause duplicate sends	Idempotency keys per notification
User preferences	Different channels, quiet hours, opt-outs	Preference service checked before dispatch

Notification Channels & Providers

Each channel has its own delivery API and constraints:

Channel	Provider / Protocol	Typical Latency	Throughput Limit
Push (Mobile)	Apple APNs, Google FCM	Seconds	10K–100K/sec per provider
Email	SendGrid, Mailgun, SES	Seconds–minutes	Rate limited per domain
SMS	Twilio, AWS SNS	1–5 seconds	Carrier-regulated (10DLC)
In-App	WebSocket / SSE to client	Sub-second	Unlimited (internal)

💡 Key Insight: Push notifications require device registration tokens. These tokens expire. Your system must handle token refresh and stale token removal gracefully — FCM returns InvalidRegistration for stale tokens.

High-Level Architecture

User Action / Event
        │
        ▼
  [Notification Service]  ← REST API: POST /notifications
        │
        ▼
  [Kafka Topic: notifications]
        │
   ┌────┼────────────┐
   ▼    ▼            ▼
[Push] [Email]   [SMS]
Worker Worker    Worker
   │    │            │
   ▼    ▼            ▼
[APNs/FCM] [SendGrid] [Twilio]

Why Kafka in the middle?

Decouples the notification triggering from the delivery
Buffers spikes (e.g., a news event → 10M push sends)
Replay ability if a channel worker crashes
Per-channel partitioning — SMS workers don't compete with push workers

Component Deep Dive

1. Notification Service (API Layer)

Accepts notification requests and publishes to Kafka. Enforces rate limits per user per channel.

POST /v1/notifications
{
  "user_id": "u123",
  "type": "order_confirmed",
  "channels": ["push", "email"],
  "data": { "order_id": "ord456", "amount": "$29.99" },
  "idempotency_key": "ord456-confirmed"
}

Idempotency key is critical. If the API caller retries (network timeout), the same key prevents duplicate sends. Store idempotency_key → sent_at in Redis with a 24-hour TTL.

2. User Preference Service

Before dispatching, check:

Opt-in status: has the user enabled this channel?
Quiet hours: no push 11pm–7am in user's timezone
Frequency caps: max 3 marketing emails/day
Notification type preferences: user wants order updates but not promotions

GET /v1/users/{user_id}/preferences
→ {
    "push": { "enabled": true, "quiet_hours": "23:00-07:00" },
    "email": { "enabled": true, "daily_limit": 3 },
    "sms": { "enabled": false }
  }

Store preferences in a fast key-value store (Redis or DynamoDB). Every notification dispatch reads from this service.

3. Channel Workers

Push Worker (APNs/FCM):

Retrieves device tokens from the Device Token Store
Sends HTTP/2 request to APNs or FCM HTTP v1 API
Handles token invalidation: remove stale tokens on 410 Gone from APNs
FCM supports topic messaging (send once, millions receive) — use for broadcast notifications

Email Worker:

Renders HTML templates with user-specific data
Sends via SMTP relay (SendGrid, SES)
Tracks opens/clicks via tracking pixels and redirect URLs
Handles bounces via webhook callbacks from the email provider

SMS Worker:

Subject to carrier throughput limits (10DLC in US: ~10 msgs/sec per campaign)
Queue SMS sends to stay under rate limits — do NOT burst
International? Use Twilio's intelligent routing

4. Delivery Receipt Store

Track status for every notification:

Field	Description
`notification_id`	UUID
`user_id`	Target user
`channel`	push / email / sms
`status`	queued / sent / delivered / failed / bounced
`attempts`	Retry count
`sent_at`	Timestamp
`delivered_at`	Channel-confirmed delivery time

Use a time-series DB (InfluxDB) or write to Cassandra (partition by notification_id).

Fan-out: One Event → Millions of Users

Example: A major sports platform needs to notify all users of Team A when a goal is scored. Team A has 5M followers. How?

Naive approach: Loop through 5M users, send 5M API calls → 5M database reads for preferences → kills the DB.

Production approach:

[Goal Event] → Kafka
                  │
                  ▼
          [Fan-out Service]
          Reads follower list in batches (1K at a time)
          Filters by user preferences (batch Redis multi-get)
          Publishes per-user notification jobs to Kafka
                  │
             ─────────────
             │           │
         [Worker 1]  [Worker 2] ... N workers
         (push)       (email)

Key optimization: Use a pre-computed fan-out (like Twitter's approach) — when a user gains followers, incrementally update follower lists rather than computing at notification time.

For smaller fan-outs (< 10K users): Compute fan-out on-demand.

Retry Strategy & Dead Letter Queue

Transient failures (network blip, provider 503) should be retried. Permanent failures (invalid token, email bounce) should NOT retry.

Retry Schedule:
  Attempt 1: immediate
  Attempt 2: +30 seconds
  Attempt 3: +2 minutes
  Attempt 4: +10 minutes
  Attempt 5: → Dead Letter Queue

Dead Letter Queue (DLQ): Failed notifications land here for manual inspection, alerting, and eventual escalation (e.g., try a different channel).

⚠️ Warning: Retrying SMS messages can cause significant cost if not handled carefully. Always check for permanent error codes before retrying.

Scalability Numbers

Back-of-envelope for 100M daily active users:

3 notifications/user/day = 300M notifications/day
300M / 86,400 seconds = ~3,500 notifications/second average
Peak (prime time, major event) = 10× average = 35,000 /sec

For 35K/sec:

Push: FCM handles 10K/sec per connection → 4 push worker instances
Email: 35K/day budget on free tier → need paid plan, multiple IPs
SMS: Most expensive channel, rate-limited, should be rare

Storage (Delivery Receipts):

300M records/day × 200 bytes = ~60 GB/day
Use Cassandra or S3 for long-term storage, Redis for recent lookups

Interview Answer Framework

When asked "Design a Notification System":

Clarify channels: push only? email? SMS? in-app?
Clarify scale: how many users, notifications/day?
Mention idempotency: retries must not cause duplicates
Describe the queue: Kafka between trigger and delivery
Address fan-out: how do you handle 1 event → millions?
Preference service: respect user settings and quiet hours
Failure handling: DLQ, retry strategy, channel fallback

💡 Pro Tip: Mention the token management problem for push — device tokens expire and must be cleaned up. This shows real-world depth.

Common Failure Modes

Failure	Impact	Prevention
Duplicate sends	Bad user experience	Idempotency keys in Redis
Stale push tokens	Silent failures	Remove tokens on provider error
Fan-out storm	DB overload on large events	Async Kafka fan-out with batching
Rate limit breach	SMS/email provider suspension	Per-channel rate limiters
Timezone bugs	Sending during user's quiet hours	Always store timezone, compute server-side

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notification System Design

Why Notification Systems Are Hard

Notification Channels & Providers

High-Level Architecture

Component Deep Dive

1. Notification Service (API Layer)

2. User Preference Service

3. Channel Workers

4. Delivery Receipt Store

Fan-out: One Event → Millions of Users

Retry Strategy & Dead Letter Queue

Scalability Numbers

Interview Answer Framework

Common Failure Modes

FilesExpand file tree

notification-system.md

Latest commit

History

notification-system.md

File metadata and controls

Notification System Design

Why Notification Systems Are Hard

Notification Channels & Providers

High-Level Architecture

Component Deep Dive

1. Notification Service (API Layer)

2. User Preference Service

3. Channel Workers

4. Delivery Receipt Store

Fan-out: One Event → Millions of Users

Retry Strategy & Dead Letter Queue

Scalability Numbers

Interview Answer Framework

Common Failure Modes