Database Connection Pool Exhaustion — Incident RCA
March 31, 2026 · Prepared for: [Your Organization]
Severity
P2
Service outage
45 min
Peak error rate
15%
Users impacted
15% of requests
Status
Resolved
March 31, 2026 · Prepared for: [Your Organization]
Severity
P2
Service outage
45 min
Peak error rate
15%
Users impacted
15% of requests
Status
Resolved
On April 12, 2025, at 16:22 UTC, our primary PostgreSQL database connection pool was exhausted, causing severe API degradation for 45 minutes. A combination of factors led to the incident: (1) a slow analytical query that held connections for 8+ minutes, (2) connection leaks in a recently deployed background job (order-sync-worker v1.2.0), and (3) a traffic spike from a marketing campaign that increased connection demand by 40%. With all 200 connections in use, new requests queued and eventually timed out. P99 API latency reached 30 seconds. Resolution required killing the long-running query, rolling back the leaky worker, and temporarily increasing the pool size.
The primary root cause was connection pool exhaustion from two concurrent issues:
Slow query: An ad-hoc analytical query performed a full table scan on the 50M-row orders table. It held 5 connections for 8+ minutes. The query lacked proper indexing and had no statement timeout.
Connection leak: The order-sync-worker v1.2.0 had a bug where connections were acquired but not released in an error path. Under load, the worker opened connections faster than they were closed, eventually consuming the entire pool.
A contributing factor was the traffic spike from the marketing campaign, which increased normal connection usage and reduced the buffer before exhaustion.
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| [ ] P1 | Add statement_timeout (5 min) to analytical queries | @backend | 2024-04-15 | Open |
| [ ] P1 | Fix connection leak in order-sync-worker. Add integration test. | @platform | 2024-04-14 | Open |
| [ ] P2 | Add index for orders analytical query (created_at, status) | @dba | 2024-04-16 | Open |
| [ ] P2 | Connection pool monitoring — alert at 80% utilization | @sre | 2024-04-17 | Open |
| [ ] P3 | PgBouncer evaluation — connection pooling for read replicas | @platform | 2024-04-30 | Open |
Detection (16:22 UTC): Pool exhausted (200/200). API latency spiked. Alerts fired. MTTD: ~7 minutes from first slow query to full exhaustion.
Response (16:22–16:28 UTC): Incident declared. DB team identified long-running query + leaky order-sync-worker. Kill query, scale worker to 0.
Resolution (16:28–17:07 UTC): Connections freed. Rollback worker to v1.1.9. Add index for analytical query. Pool stable by 16:42. All systems operational at 17:07. MTTR: 45 minutes.
[ERROR] PostgreSQL connection pool at 200/200. New connections queued.
SELECT * FROM orders WHERE status='pending' -- 8m 12s, 5 connections held. Full table scan.
[WARN] order-sync-worker Connection acquired but not released in error path. Pool usage: 187/200
Your next incident deserves the same analysis.
Find root cause from your logs in 2 minutes.
Try with your logs