Reliability, Observability & Cost

A cloud architecture is not production-ready just because it deploys. It needs clear failure tolerance, measurable health, recoverability, and cost control that still works when traffic or experiments scale up.

Reliability Starts with Failure Scope

The first question is not "how available do we want to be?" but "what failures must we survive?"

Failure scope	Typical response
Instance failure	Replace unhealthy node, reschedule container, retry safely
AZ failure	Run across multiple AZs with redundant app and data layers
Region failure	DR plan or multi-region architecture depending on business need
Dependency failure	Timeouts, retries, fallbacks, circuit breakers, graceful degradation

Good default

Most systems should start with multi-AZ high availability in one region. Multi-region is usually a business decision, not a default engineering reflex.

Disaster Recovery: RTO and RPO

Interviewers love asking for backup strategy, but the right answer depends on recovery targets.

Term	Meaning	Example
RTO	Recovery Time Objective: how long you can be down	"We must recover within 30 minutes"
RPO	Recovery Point Objective: how much data loss is acceptable	"We can lose at most 5 minutes of data"

Design implications

low RTO + low RPO often pushes you toward replication and automated failover
higher RTO/RPO may make periodic backups sufficient
databases, object storage, and queues can all have different recovery requirements

Always make backup strategy explicit:

snapshot schedule
retention window
restore testing
cross-region copy if needed

Observability Is a Product Requirement

Cloud incidents are much harder to manage if the system is opaque.

The minimum set

metrics for traffic, latency, errors, saturation
logs for request context, failures, security-sensitive actions
traces for multi-service latency and dependency debugging
alerts tied to SLOs, not random dashboards nobody watches

Useful service-level metrics

Layer	Metrics
API tier	QPS, p95/p99 latency, error rate, concurrency
Workers	queue lag, processing duration, retry rate
Database	CPU, storage, connection pool, replication lag, slow queries
Platform	autoscaling events, unhealthy targets, deployment failures

Cost Is a Design Constraint

Cloud cost problems are usually architecture problems in disguise.

Common cost drivers

overprovisioned compute
idle development clusters
excessive cross-zone or cross-region traffic
high NAT / egress usage
chatty microservices
retaining too much hot storage
expensive always-on GPU or inference capacity

Cost controls worth mentioning

right-size compute after measuring actual usage
scale workers on queue depth instead of fixed fleet size
use reserved/committed capacity for stable workloads
use spot/preemptible capacity for interruptible jobs
move cold data to cheaper storage tiers
tag resources so cost is attributable to an owner
create budget alarms before finance finds the issue first

Recommended Default

For a serious but not overengineered production system:

Run the service in one region across multiple AZs
Define SLOs and alert on error budget burn, not only raw CPU
Set backups, retention, and restore drills
Use graceful degradation when dependencies fail
Add cost dashboards and budget alarms from the start
Justify multi-region with explicit RTO/RPO, revenue impact, or compliance needs

Failure Modes

Failure mode	What happens	Mitigation
Multi-region too early	Huge cost and ops burden without real business value	Start multi-AZ unless requirements say otherwise
No restore testing	Backups exist on paper but fail in reality	Scheduled restore drills and runbooks
Alert noise	On-call ignores pages	SLO-based alerts, deduplication, ownership
Dependency hard failure	One service outage cascades through the stack	Timeouts, retries, circuit breakers, fallback responses
Unowned spend growth	Costs rise with no team accountable	Tagging, chargeback/showback, budget alerts

Metrics

Availability and SLO attainment
Error budget burn
Mean time to detect / mean time to recover
Replication lag and backup success rate
Restore test success
Unit economics such as cost per request, cost per active user, cost per job, or cost per inference

These metrics tie reliability and cloud spend back to product reality.

Interview Answer Sketch

"I would begin with a single-region, multi-AZ deployment because it covers common infrastructure failures without the full cost and complexity of active-active multi-region. I would define SLOs for availability and latency, instrument the service with metrics, logs, and traces, and alert on error-budget burn rather than noisy infrastructure-only signals. For recovery, I would make RTO and RPO explicit, set backup and restore procedures, and test restores regularly. I would also add cost guardrails such as tagging, budget alerts, and right-sizing so the platform remains financially sustainable as traffic grows."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability, Observability & Cost

Reliability Starts with Failure Scope

Good default

Disaster Recovery: RTO and RPO

Design implications

Observability Is a Product Requirement

The minimum set

Useful service-level metrics

Cost Is a Design Constraint

Common cost drivers

Cost controls worth mentioning

Recommended Default

Failure Modes

Metrics

Interview Answer Sketch

FilesExpand file tree

reliability-observability-cost.md

Latest commit

History

reliability-observability-cost.md

File metadata and controls

Reliability, Observability & Cost

Reliability Starts with Failure Scope

Good default

Disaster Recovery: RTO and RPO

Design implications

Observability Is a Product Requirement

The minimum set

Useful service-level metrics

Cost Is a Design Constraint

Common cost drivers

Cost controls worth mentioning

Recommended Default

Failure Modes

Metrics

Interview Answer Sketch