You can't fix what you can't see. Every production system needs metrics, logs, and traces — the three pillars of observability.
Metrics, Logs, Traces
| Pillar | What it is | Answers | Tools |
|---|---|---|---|
| Metrics | Numeric measurements over time: request rate, error rate, CPU usage, p99 latency. Aggregated, low cost. | "Is something wrong right now? What's the trend?" | Prometheus, Grafana, Datadog, CloudWatch |
| Logs | Timestamped records of discrete events: "User 123 failed login", "Payment processed for order 456". Rich context but expensive at scale. | "What exactly happened? What was the error message?" | ELK Stack (Elasticsearch + Logstash + Kibana), Splunk, CloudWatch Logs |
| Traces | End-to-end journey of a single request through all microservices. Each service adds a span with timing and context. | "Why was this specific request slow? Which service was the bottleneck?" | Jaeger, Zipkin, AWS X-Ray, Datadog APM |
💡 Mention this in every design interview: "I'd add observability: Prometheus for metrics, structured logging to Elasticsearch, and distributed tracing with OpenTelemetry. Alerts on P99 latency, error rate, and saturation metrics. SLOs with error budgets to drive reliability investments." This shows production maturity.
Key metric types:
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value. Reset on restart. | http_requests_total{status=200} |
| Gauge | Value that goes up and down. Current state. | active_connections, memory_usage_bytes |
| Histogram | Sample distribution bucketed into ranges. | http_request_duration_seconds{le="0.1"} |
| Summary | Pre-computed percentiles (less flexible than histogram). | p50, p95, p99 latency |
The Four Golden Signals (Google SRE):
| Signal | What it measures | Alert threshold |
|---|---|---|
| Latency | Time to serve requests (P99, P95, P50) | P99 > SLO (e.g., > 200ms) |
| Traffic | Requests per second (volume) | Sudden drop (outage) or spike (abuse) |
| Errors | Rate of failed requests (5xx, timeouts) | Error rate > 1% |
| Saturation | How "full" the service is (CPU, memory, queue depth) | CPU > 80% for >5 min |
Services expose /metrics endpoint (Prometheus format)
↓
Prometheus scrapes every 15 seconds (pull model)
↓
Stores in time-series database (locally, 15-day default)
↓
Grafana queries Prometheus via PromQL
↓
Alertmanager evaluates alert rules → PagerDuty/Slack
PromQL example:
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.01
→ Alert if 5xx error rate > 1% over last 5 minutes
A trace records the full journey of a single request across all microservices:
Trace for order checkout:
Span: API Gateway 0ms ─────────────────────────── 450ms
Span: Auth Service 2ms ─── 15ms
Span: Order Service 20ms ──────────────── 380ms
Span: DB Query 25ms ── 40ms
Span: Payment Service 50ms ────────────── 300ms
Span: Payment DB 55ms ──── 90ms
Span: Fraud Check 100ms ────── 200ms ← BOTTLENECK
Span: Card Network 210ms ──── 290ms
Span: Inventory Service 310ms ── 370ms
Span: Notification Queue 385ms ── 395ms
Every span has:
trace_id: unique ID for the entire requestspan_id: unique ID for this operationparent_span_id: which span triggered this onestart_time,end_time: timingtags: key-value metadata (user_id, order_id, region)logs: events within the span (errors, retries)
OpenTelemetry is the standard for instrumentation — it's vendor-neutral and works with any backend (Jaeger, Zipkin, Datadog, AWS X-Ray).
Logs should be structured (JSON, not free text) for efficient querying:
// Bad (unstructured):
"[2024-01-15 10:30:45] User login failed for user@example.com from 192.168.1.1"
// Good (structured):
{
"timestamp": "2024-01-15T10:30:45Z",
"level": "WARN",
"service": "auth-service",
"event": "login_failed",
"user_email": "user@example.com",
"ip_address": "192.168.1.1",
"reason": "invalid_password",
"attempt_count": 3,
"trace_id": "abc123def456"
}Log levels:
| Level | Use for |
|---|---|
DEBUG |
Detailed debugging info (disabled in production) |
INFO |
Normal business events (user created, order placed) |
WARN |
Recoverable issues (retry succeeded, rate limit hit) |
ERROR |
Failures requiring investigation (DB connection failed) |
FATAL |
Service cannot continue (corrupted state, missing config) |
Alert design rules:
- Alert on symptoms, not causes — alert on "users can't log in", not "CPU is high"
- Alert on SLO breach — if your SLO is 99.9% availability, alert when error rate threatens that budget
- Every alert must be actionable — if you can't fix it, don't page someone at 3am
- Alert fatigue — too many alerts → engineers ignore them → real incidents missed
Error budgets (Google SRE):
SLO: 99.9% availability = 0.1% error budget
Monthly error budget: 43.8 minutes of downtime
Current month error rate: 0.05% → 50% of budget consumed
Alert when: budget consumption rate means you'll exceed the SLO before month end
Action: slow down risky deployments, focus on reliability work
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | Metric measuring service quality | P99 latency = 85ms |
| SLO (Service Level Objective) | Internal target for SLI | P99 latency < 200ms |
| SLA (Service Level Agreement) | External contract with customers | "We guarantee 99.9% uptime; credit if breached" |
- "I'd add the three pillars: Prometheus + Grafana for metrics, structured JSON logs to Elasticsearch, Jaeger for distributed tracing. OpenTelemetry SDK in each service — vendor-neutral."
- "Alerts on the Four Golden Signals: P99 latency, traffic anomalies, error rate, and saturation. Every alert is actionable — if there's nothing to do, it's not a page."
- "Error budgets drive the reliability vs. velocity trade-off. If we've used 80% of this month's error budget, we slow down risky deployments until the month resets."
- "The trace_id propagated through every service call means I can go from a user complaint ('checkout is slow') to the exact slow span in under 30 seconds."