In a distributed system, one failing service can take down everything else. Imagine our payment service is slow. Every request to it hangs for 30 seconds. Our order service calls payment and waits. Its threads get consumed. Now the order service can’t handle any requests either. Soon, the whole system is frozen. This is a cascading failure, and it’s one of the scariest things in distributed systems.
Circuit breakers and bulkheads are patterns that prevent this.
Circuit Breaker Pattern
Think of it like an electrical circuit breaker in our house. When there’s a power surge, the breaker trips and cuts the circuit — protecting everything from frying. Same idea in software.
A circuit breaker wraps calls to an external service and monitors for failures. When failures cross a threshold, the breaker “trips” and stops making calls entirely. Instead of waiting 30 seconds for a timeout, we fail instantly. This gives the downstream service time to recover without being hammered by requests.
The Three States
Closed — Everything is normal. Requests pass through to the downstream service. The breaker counts failures. If failures stay below the threshold, life is good.
Open — Too many failures happened. The breaker trips. All requests immediately fail with a fallback response. No calls are made to the downstream service at all. This state lasts for a configured timeout (e.g., 30 seconds).
Half-Open — After the timeout, the breaker lets one request through as a test. If it succeeds, the breaker closes (back to normal). If it fails, the breaker opens again for another timeout period.
What Happens When the Circuit Is Open?
We don’t just show errors. We use fallbacks:
- Return cached data (slightly stale but better than nothing)
- Return a default value (“0 items in cart” instead of crashing)
- Queue the request for later processing
- Show a degraded experience (“Recommendations unavailable”)
# Pseudocode for a circuit breaker
def get_recommendations(user_id):
if circuit_breaker.is_open():
return cached_recommendations(user_id) # fallback
try:
result = recommendation_service.get(user_id)
circuit_breaker.record_success()
return result
except TimeoutError:
circuit_breaker.record_failure()
return cached_recommendations(user_id) # fallback
Configuration
Typical circuit breaker settings:
- Failure threshold: 5 failures in 60 seconds → trip
- Open duration: 30 seconds before trying half-open
- Half-open test: Allow 1 request through
- Success threshold: 3 consecutive successes in half-open → close
These numbers aren’t magic. We tune them based on our service’s behavior.
Bulkhead Pattern
The bulkhead pattern isolates different parts of our system so a failure in one doesn’t sink the rest. The name comes from ships — a ship’s hull is divided into watertight compartments (bulkheads). If one compartment floods, the others stay dry and the ship stays afloat.
In software, we create isolation boundaries:
Thread pool bulkhead: Each downstream service gets its own thread pool. If the payment service is slow and consumes all its threads, the order-processing thread pool is untouched.
┌─────────────────────────────────────┐
│ Application Server │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Payment Pool │ │ Order Pool │ │
│ │ 10 threads │ │ 20 threads │ │
│ │ ⚠️ 10/10 │ │ ✅ 5/20 │ │
│ │ (saturated) │ │ (healthy) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Payment is broken but orders │
│ keep processing normally. │
└─────────────────────────────────────┘
Connection pool bulkhead: Same idea but with database connections. One slow query pattern doesn’t exhaust connections needed by other queries.
Process bulkhead: Run different features as separate services or containers. If the image processing service OOMs, the authentication service keeps running.
Retry with Exponential Backoff
When a request fails, our first instinct is to retry immediately. Bad idea. If the service is overloaded, 1,000 clients all retrying instantly makes it worse. That’s a retry storm.
Instead, we use exponential backoff:
Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds
(give up after max retries)
We also add jitter — a random offset so all clients don’t retry at the exact same time:
Attempt 1: wait 1s + random(0-500ms)
Attempt 2: wait 2s + random(0-500ms)
Attempt 3: wait 4s + random(0-500ms)
When NOT to Retry
- 4xx errors (400, 401, 403, 404) — The request is bad. Retrying won’t fix it.
- Non-idempotent operations — If we POST a payment and it timed out, retrying might charge twice. Only retry operations that are safe to repeat.
Combining All Three
In practice, we use circuit breakers, bulkheads, and retries together:
- Bulkhead isolates the failure so it doesn’t spread
- Retry with backoff handles transient errors
- Circuit breaker stops us from hammering a dead service
Request
→ Bulkhead (uses dedicated thread pool)
→ Circuit Breaker (checks if service is healthy)
→ Retry with backoff (handles transient failures)
→ Actual service call
Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (Java, deprecated but influential) implement all three patterns.
Key Takeaway
In simple language, circuit breakers stop us from calling a broken service (fail fast instead of waiting forever). Bulkheads isolate failures so one broken thing doesn’t sink everything. Retries with exponential backoff handle temporary hiccups without overwhelming the system. Together, these three patterns are how we build systems that degrade gracefully instead of collapsing entirely when things go wrong.