Stripe API Timeout — Incident RCA
March 31, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
47 min
Peak error rate
100%
Users impacted
~12,000
Revenue impact
~$340K
Status
Resolved
March 31, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
47 min
Peak error rate
100%
Users impacted
~12,000
Revenue impact
~$340K
Status
Resolved
On February 14, 2025, between 14:23 UTC and 15:10 UTC, our payment processing pipeline experienced a complete outage affecting all checkout flows. The incident was triggered by Stripe's API returning elevated latency (P99 > 8s) and intermittent 503 responses. Our circuit breaker configuration did not trip quickly enough, causing connection pool exhaustion and cascading failures across the payment service. Status was restored after rolling back a recent deployment and increasing circuit breaker sensitivity. Approximately 12,000 users were unable to complete purchases during the 47-minute window.
The primary root cause was a misconfigured circuit breaker in our payment service. When Stripe's API began returning slow responses and 503s, our circuit breaker required 10 consecutive failures over 30 seconds before opening. During this window, each request held a connection for 8+ seconds (waiting on Stripe timeout), rapidly exhausting our 50-connection pool. Once exhausted, all payment requests failed with 503 regardless of circuit state.
A contributing factor was the recent deployment (v2.4.1) which increased the default Stripe client timeout from 5s to 15s. This extended the connection hold time and accelerated pool exhaustion.
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| [ ] P1 | Reduce circuit breaker threshold (10/30s → 5/10s) | @platform | 2024-02-18 | Open |
| [ ] P1 | Implement payment fallback — cached card tokens when Stripe degraded | @payments | 2024-02-25 | Open |
| [ ] P2 | Connection pool tuning — increase size, per-endpoint limits | @platform | 2024-02-20 | Open |
| [ ] P2 | Stripe status webhook — proactive alerting | @sre | 2024-02-22 | Open |
| [ ] P3 | Runbook update — rollback procedure, Stripe degradation playbook | @oncall | 2024-02-16 | Open |
Detection (14:23 UTC): First alert: Stripe API P99 latency > 5s. On-call paged. MTTD: ~5 minutes from Stripe degradation to our detection.
Response (14:23–14:48 UTC): Payment service began failing. Circuit breaker did not trip. Connection pool exhausted by 14:32. War room established. Stripe status page confirmed upstream issue. Decision: rollback payment-service.
Resolution (14:48–15:10 UTC): Rollback to v2.4.0 completed at 14:55. Stripe resolved by 15:02. All systems operational at 15:10. MTTR: 47 minutes.
{"level":"error","msg":"Stripe API timeout","duration_ms":8000,"error":"context deadline exceeded","retry_count":3}
[ERROR] payment-service Connection pool exhausted (50/50). Rejecting new requests.
Stripe Status: Degraded Performance. API latency elevated. Incident STRIPE-2024-0214
Your next incident deserves the same analysis.
Find root cause from your logs in 2 minutes.
Try with your logs