Graceful Shutdown

intermediate nodejs shutdown signals production docker

When Docker, Kubernetes, or PM2 wants to stop our app — for a deploy, a scale-down, or a node drain — they send SIGTERM. If our app ignores it, after a grace period (10 seconds for Docker, 30 for K8s) they send SIGKILL and we get killed mid-request.

That means: dropped HTTP requests, half-committed DB writes, lost jobs. In production, this is unacceptable.

Graceful shutdown is “react to SIGTERM, finish what we’re doing, then exit cleanly.”

The lifecycle

Graceful Shutdown Timeline
t=0 · Orchestrator sends SIGTERM
t=0+ · Stop accepting new connections (server.close())
t=0+ · Health check starts returning 503 → LB stops sending traffic
t=0..N · In-flight requests finish naturally
t=N · Close DB pool, Redis, message queue connections
t=N+ε · process.exit(0)
t=30s · Hard timeout — force exit if still alive (avoid SIGKILL)

A minimal Express implementation

import express from 'express';
import { pool } from './db.js';

const app = express();
app.get('/', async (req, res) => {
  await new Promise((r) => setTimeout(r, 2000)); // slow handler
  res.send('hi');
});

const server = app.listen(3000, () => console.log('listening on 3000'));

let shuttingDown = false;

// Health check that flips on shutdown
app.get('/healthz', (req, res) => {
  if (shuttingDown) return res.status(503).send('shutting down');
  res.send('ok');
});

async function shutdown(signal) {
  if (shuttingDown) return;
  shuttingDown = true;
  console.log(`${signal} received, shutting down`);

  // 1. Stop accepting new connections
  server.close((err) => {
    if (err) console.error('server.close error', err);
    console.log('http server closed');
  });

  // 2. Wait for in-flight, then close downstream resources
  // (server.close() waits for existing connections to finish)
  try {
    await pool.end();        // close pg pool
    // await redis.quit();   // close redis, etc.
    console.log('db closed');
  } catch (err) {
    console.error('cleanup error', err);
  }

  // 3. Hard timeout — if something's stuck, give up before SIGKILL hits
  setTimeout(() => {
    console.error('forced exit after 25s');
    process.exit(1);
  }, 25_000).unref();
}

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));   // Ctrl+C in dev

A few things worth calling out:

  • server.close() doesn’t kill existing connections. It stops accept() for new ones and waits for the current ones to finish. Exactly what we want.
  • Health check flips first. The load balancer needs a few seconds to notice we’re unhealthy and route traffic elsewhere. If we close the server immediately, the LB might send us one more request that hits a closed socket.
  • .unref() on the timeout. So the timer itself doesn’t keep the process alive if everything else finishes early.

The “stop accepting + drain” dance

In simple language: we’re telling the world “no more orders please” while still cooking the orders we already accepted. Once the kitchen is clear, we close up shop.

For long-lived connections (WebSockets, SSE), server.close() waits forever because those connections never end on their own. We have to actively tell clients to disconnect:

// For WebSockets
for (const ws of wsServer.clients) {
  ws.close(1001, 'server restarting');
}

For HTTP keep-alive, idle connections can hang around. Use the http-terminator library or set server.closeIdleConnections() (Node 18.2+) to forcibly close idle keep-alive sockets.

Why Docker/Kubernetes need this

Docker sends SIGTERM to PID 1 in the container, waits --stop-timeout (default 10s), then SIGKILL.

Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL.

If our Node app is PID 1 (running directly via CMD ["node", "server.js"]), we receive the signal. Done.

But if we use a shell form (CMD node server.js), the shell becomes PID 1 and does not forward signals. Our Node process never gets SIGTERM, falls to SIGKILL, drops requests. Bad.

Fix: always use exec form in Dockerfile.

# BAD — shell form
CMD node server.js

# GOOD — exec form, Node is PID 1
CMD ["node", "server.js"]

Or use tini / dumb-init as PID 1 if we need signal forwarding (e.g. when running via npm).

Kubernetes preStop hook

K8s has a subtle race: when a pod is terminated, the SIGTERM is sent at roughly the same time the pod is removed from the Service endpoints list. For a few seconds, traffic might still hit a shutting-down pod.

The fix is a preStop hook that sleeps before the signal is sent:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"]

5 seconds is usually enough for the endpoints update to propagate. Our app keeps serving normally during the sleep, then gets SIGTERM and shuts down cleanly.

Common mistakes

  • No timeout. A stuck DB connection hangs shutdown() forever, then SIGKILL kills us. Always have a hard timeout that beats the orchestrator’s.
  • Closing the DB pool before HTTP finishes. Now in-flight requests can’t query the DB and fail. Order matters: HTTP first, then resources.
  • Catching SIGTERM but doing nothing. Worse than not handling it — Node’s default is to exit, our handler overrides that.
  • PM2 cluster reload — same story. PM2 sends SIGINT to each worker. If we don’t handle it, reload drops requests.
  • Running with nodemon or a shell wrapper in prod. They eat the signal. Use the runtime directly or tini.