Distributed Locking

Ensuring only one process performs a task at a time (e.g., processing a payment, running a cron job) in a cluster of servers.

Why Local Locks Don't Work

Standard mutexes (synchronized, threading.Lock) only work within one process. In a distributed system with 50 servers, you need an external system to coordinate the lock.

The problem without distributed locking:

Server 1 checks: "Is this payment already processed?" → No
Server 2 checks: "Is this payment already processed?" → No (race condition!)

Server 1 processes payment → charges $100
Server 2 processes payment → charges $100 (duplicate!)

Total charged: $200 instead of $100

Redis Redlock

Redlock is the standard algorithm for distributed locking with Redis. To acquire a lock:

Client generates a unique token (UUID).
Tries to write key lock:resource_id with value token and a TTL (e.g., 10s) using SET NX PX.
If successful, it holds the lock. It must finish work and release (delete key) within 10s.
If it takes longer, the lock auto-expires (safety). To prevent this, the client must use a "watchdog" thread to extend the TTL while working.
Release: Use a Lua script to check if the value matches the token before deleting. This prevents deleting a lock that has already expired and been acquired by someone else.

# Acquire lock
SET lock:payment:order_123 <unique_token> NX PX 10000
# NX = only set if Not eXists (atomic check-and-set)
# PX 10000 = expire in 10 seconds

# Release lock (Lua script — atomic)
if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end

Why the Lua script for release?

Without Lua script:
1. GET lock:payment:order_123 → returns "our_token" 
   [1 second pause here — GC, network, etc.]
2. Another process acquires the lock (TTL expired)
3. DEL lock:payment:order_123 → DELETES THEIR LOCK! Bug!

With Lua script:
- GET and conditional DEL are atomic — no race condition

Multi-Node Redlock (5 Redis Nodes)

The single-node Redis lock has a problem: if Redis crashes between acquire and release, the lock is lost. For critical locks, use multi-node Redlock:

Get current time T1
Try to acquire lock on all 5 Redis nodes sequentially, each with a small timeout (50ms)
If lock acquired on 3+ nodes (majority) within valid time window: lock is acquired
Lock validity time = TTL - (T2 - T1) [elapsed time during acquisition]
Release: delete the key from all 5 nodes

The quorum ensures that even if one Redis node fails, the majority still holds the lock.

Fencing Tokens — The Safety Net

What if a process pauses (GC pause) for 20 seconds? Its lock expires. Another worker acquires it. The first worker wakes up and writes to the DB. Now two workers are writing simultaneously (split brain). Distributed locks alone are unsafe for critical data.

Solution: Fencing Tokens. The lock service (Zookeeper) returns a monotonically increasing number (token) with each lock. The database rejects any write with a token older than the highest one it has seen.

Worker A gets Lock (Token = 33) → GC Pause...
Worker B gets Lock (Token = 34) → Writes to DB (DB records "last_token=34")
Worker A wakes up → Tries to write with Token 33
DB check: 33 < 34 → REJECT write.

This is called the fencing token pattern and is the only safe way to handle distributed critical sections when processes can pause unexpectedly.

Zookeeper for Distributed Coordination

For stronger guarantees than Redis, use Zookeeper (or etcd):

Leader election with Zookeeper:
1. All servers try to create an ephemeral sequential znode: /election/node-NNNNNN
2. Each server lists children of /election/
3. Server with lowest sequence number = leader
4. Others watch the node just before them (not the leader)
5. If predecessor dies, the next-lowest takes over

Lock with Zookeeper:
1. Create ephemeral sequential znode: /locks/payment_123/lock-NNNNNN
2. If yours is lowest: you have the lock
3. Otherwise: watch the preceding znode (wait for it to be deleted)
4. When preceding lock released: your turn → you have the lock
5. When you're done: delete your znode → wakes up the next waiter

Ephemeral nodes auto-delete when the session ends (server dies/disconnects) — this prevents deadlocks from crashed lock holders.