Circuit Breaker and Bulkhead Patterns

advanced 4-7 YOE circuit-breaker bulkhead resilience retry cascading-failure

In a distributed system, one failing service can take down everything else. Imagine our payment service is slow. Every request to it hangs for 30 seconds. Our order service calls payment and waits. Its threads get consumed. Now the order service can’t handle any requests either. Soon, the whole system is frozen. This is a cascading failure, and it’s one of the scariest things in distributed systems.

Circuit breakers and bulkheads are patterns that prevent this.

Circuit Breaker Pattern

Think of it like an electrical circuit breaker in our house. When there’s a power surge, the breaker trips and cuts the circuit — protecting everything from frying. Same idea in software.

A circuit breaker wraps calls to an external service and monitors for failures. When failures cross a threshold, the breaker “trips” and stops making calls entirely. Instead of waiting 30 seconds for a timeout, we fail instantly. This gives the downstream service time to recover without being hammered by requests.

The Three States

Circuit Breaker State Machine
CLOSED
Normal operation
Requests flow through
Counting failures
failure threshold exceeded →
OPEN
Requests fail fast
No calls to service
Waiting for timeout
timeout expires →
HALF-OPEN
Let one request through
If success → CLOSED
If fail → OPEN

Closed — Everything is normal. Requests pass through to the downstream service. The breaker counts failures. If failures stay below the threshold, life is good.

Open — Too many failures happened. The breaker trips. All requests immediately fail with a fallback response. No calls are made to the downstream service at all. This state lasts for a configured timeout (e.g., 30 seconds).

Half-Open — After the timeout, the breaker lets one request through as a test. If it succeeds, the breaker closes (back to normal). If it fails, the breaker opens again for another timeout period.

What Happens When the Circuit Is Open?

We don’t just show errors. We use fallbacks:

  • Return cached data (slightly stale but better than nothing)
  • Return a default value (“0 items in cart” instead of crashing)
  • Queue the request for later processing
  • Show a degraded experience (“Recommendations unavailable”)
# Pseudocode for a circuit breaker
def get_recommendations(user_id):
    if circuit_breaker.is_open():
        return cached_recommendations(user_id)  # fallback

    try:
        result = recommendation_service.get(user_id)
        circuit_breaker.record_success()
        return result
    except TimeoutError:
        circuit_breaker.record_failure()
        return cached_recommendations(user_id)  # fallback

Configuration

Typical circuit breaker settings:

  • Failure threshold: 5 failures in 60 seconds → trip
  • Open duration: 30 seconds before trying half-open
  • Half-open test: Allow 1 request through
  • Success threshold: 3 consecutive successes in half-open → close

These numbers aren’t magic. We tune them based on our service’s behavior.

Bulkhead Pattern

The bulkhead pattern isolates different parts of our system so a failure in one doesn’t sink the rest. The name comes from ships — a ship’s hull is divided into watertight compartments (bulkheads). If one compartment floods, the others stay dry and the ship stays afloat.

In software, we create isolation boundaries:

Thread pool bulkhead: Each downstream service gets its own thread pool. If the payment service is slow and consumes all its threads, the order-processing thread pool is untouched.

┌─────────────────────────────────────┐
│         Application Server          │
│                                     │
│  ┌──────────────┐ ┌──────────────┐  │
│  │ Payment Pool │ │  Order Pool  │  │
│  │  10 threads  │ │  20 threads  │  │
│  │  ⚠️ 10/10    │ │  ✅ 5/20    │  │
│  │  (saturated) │ │  (healthy)   │  │
│  └──────────────┘ └──────────────┘  │
│                                     │
│  Payment is broken but orders       │
│  keep processing normally.          │
└─────────────────────────────────────┘

Connection pool bulkhead: Same idea but with database connections. One slow query pattern doesn’t exhaust connections needed by other queries.

Process bulkhead: Run different features as separate services or containers. If the image processing service OOMs, the authentication service keeps running.

Retry with Exponential Backoff

When a request fails, our first instinct is to retry immediately. Bad idea. If the service is overloaded, 1,000 clients all retrying instantly makes it worse. That’s a retry storm.

Instead, we use exponential backoff:

Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds
(give up after max retries)

We also add jitter — a random offset so all clients don’t retry at the exact same time:

Attempt 1: wait 1s + random(0-500ms)
Attempt 2: wait 2s + random(0-500ms)
Attempt 3: wait 4s + random(0-500ms)

When NOT to Retry

  • 4xx errors (400, 401, 403, 404) — The request is bad. Retrying won’t fix it.
  • Non-idempotent operations — If we POST a payment and it timed out, retrying might charge twice. Only retry operations that are safe to repeat.

Combining All Three

In practice, we use circuit breakers, bulkheads, and retries together:

  1. Bulkhead isolates the failure so it doesn’t spread
  2. Retry with backoff handles transient errors
  3. Circuit breaker stops us from hammering a dead service
Request
  → Bulkhead (uses dedicated thread pool)
    → Circuit Breaker (checks if service is healthy)
      → Retry with backoff (handles transient failures)
        → Actual service call

Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (Java, deprecated but influential) implement all three patterns.

Key Takeaway

In simple language, circuit breakers stop us from calling a broken service (fail fast instead of waiting forever). Bulkheads isolate failures so one broken thing doesn’t sink everything. Retries with exponential backoff handle temporary hiccups without overwhelming the system. Together, these three patterns are how we build systems that degrade gracefully instead of collapsing entirely when things go wrong.