Back to RCA Examples
ProdRescue AIIncident RCA
01 / 11

Stripe API Timeout — Incident RCA

March 31, 2026 · Prepared for: [Your Organization]

Severity

P1

Service outage

47 min

Peak error rate

100%

Users impacted

~12,000

Revenue impact

~$340K

Status

Resolved

ConfidentialMar 31, 2026
ProdRescue AIIncident RCA
02 / 11

Executive Summary

On February 14, 2025, between 14:23 UTC and 15:10 UTC, our payment processing pipeline experienced a complete outage affecting all checkout flows. The incident was triggered by Stripe's API returning elevated latency (P99 > 8s) and intermittent 503 responses. Our circuit breaker configuration did not trip quickly enough, causing connection pool exhaustion and cascading failures across the payment service. Status was restored after rolling back a recent deployment and increasing circuit breaker sensitivity. Approximately 12,000 users were unable to complete purchases during the 47-minute window.

Confidential02 / 11
ProdRescue AIIncident RCA
03 / 11

Timeline

  • 14:23 UTC — First alerts: Stripe API P99 latency exceeds 5s. On-call engineer paged.
  • 14:28 UTC — Payment service begins returning 503. Circuit breaker still CLOSED.
  • 14:32 UTC — Connection pool exhaustion detected. All payment requests failing.
  • 14:35 UTC — Incident declared. War room established.
  • 14:42 UTC — Stripe status page confirms API degradation. We verify upstream issue.
  • 14:48 UTC — Decision: roll back payment-service v2.4.1 to v2.4.0.
  • 14:55 UTC — Rollback complete. Partial recovery observed.
  • 15:02 UTC — Stripe reports resolution. Our metrics normalize.
  • 15:10 UTC — All systems operational. Incident resolved.
Confidential03 / 11
ProdRescue AIIncident RCA
04 / 11

Root Cause Analysis

The primary root cause was a misconfigured circuit breaker in our payment service. When Stripe's API began returning slow responses and 503s, our circuit breaker required 10 consecutive failures over 30 seconds before opening. During this window, each request held a connection for 8+ seconds (waiting on Stripe timeout), rapidly exhausting our 50-connection pool. Once exhausted, all payment requests failed with 503 regardless of circuit state.

A contributing factor was the recent deployment (v2.4.1) which increased the default Stripe client timeout from 5s to 15s. This extended the connection hold time and accelerated pool exhaustion.

Confidential04 / 11
ProdRescue AIIncident RCA
05 / 11

Impact

  • Duration: 47 minutes
  • Users affected: ~12,000 (checkout attempts during window)
  • Failed transactions: ~3,400
  • Estimated lost revenue: ~$340,000
  • Customer support tickets: 847
  • Stripe API dependency: Confirmed as single point of failure for payment flow
Confidential05 / 11
ProdRescue AIIncident RCA
06 / 11

Action Items

PriorityActionOwnerDue DateStatus
[ ] P1Reduce circuit breaker threshold (10/30s → 5/10s)@platform2024-02-18Open
[ ] P1Implement payment fallback — cached card tokens when Stripe degraded@payments2024-02-25Open
[ ] P2Connection pool tuning — increase size, per-endpoint limits@platform2024-02-20Open
[ ] P2Stripe status webhook — proactive alerting@sre2024-02-22Open
[ ] P3Runbook update — rollback procedure, Stripe degradation playbook@oncall2024-02-16Open
Confidential06 / 11
ProdRescue AIIncident RCA
07 / 11

Detection, Response & Resolution

Detection (14:23 UTC): First alert: Stripe API P99 latency > 5s. On-call paged. MTTD: ~5 minutes from Stripe degradation to our detection.

Response (14:23–14:48 UTC): Payment service began failing. Circuit breaker did not trip. Connection pool exhausted by 14:32. War room established. Stripe status page confirmed upstream issue. Decision: rollback payment-service.

Resolution (14:48–15:10 UTC): Rollback to v2.4.0 completed at 14:55. Stripe resolved by 15:02. All systems operational at 15:10. MTTR: 47 minutes.

Confidential07 / 11
ProdRescue AIIncident RCA
08 / 11

5 Whys Analysis

  1. Why did checkout fail? → Payment service returned 503 for all requests.
  2. Why 503? → Connection pool exhausted — no connections available for Stripe API calls.
  3. Why exhausted? → Circuit breaker required 10 failures over 30s; each request held connection 8+ seconds.
  4. Why so long? → v2.4.1 increased Stripe timeout from 5s to 15s.
  5. Why wasn't this tested?ROOT CAUSE: No Stripe degradation simulation in staging, circuit breaker thresholds not validated under load.
Confidential08 / 11
ProdRescue AIIncident RCA
09 / 11

Prevention Checklist

  • Circuit breaker: 5 failures / 10 seconds (fail-fast)
  • Stripe status webhook for proactive alerting
  • Payment fallback: cached card tokens when Stripe degraded
  • Connection pool: per-endpoint limits, increase size
  • Runbook: Stripe degradation playbook, rollback procedure
  • Load test: simulate Stripe 503/latency in staging
Confidential09 / 11
ProdRescue AIIncident RCA
10 / 11

Evidence & Log Samples

{"level":"error","msg":"Stripe API timeout","duration_ms":8000,"error":"context deadline exceeded","retry_count":3}
[ERROR] payment-service Connection pool exhausted (50/50). Rejecting new requests.
Stripe Status: Degraded Performance. API latency elevated. Incident STRIPE-2024-0214
Confidential10 / 11
ProdRescue AIIncident RCA
11 / 11

Lessons Learned

  • Fail-fast is critical for upstream dependencies. Our circuit breaker was too tolerant.
  • Timeout increases have multiplicative effects. The 5s → 15s change dramatically increased pool exhaustion rate.
  • Upstream status pages should drive our alerting. A Stripe status webhook would have given us a 5–10 minute head start.
  • Payment flow needs redundancy. We will prioritize the cached-token fallback to reduce single-vendor dependency.
  • Stripe API timeout incident report — this postmortem documents payment gateway failure and recovery.
Confidential11 / 11

Similar Incident RCAs

Your next incident deserves the same analysis.

Find root cause from your logs in 2 minutes.

Try with your logs