Kubernetes Crash Loop — Incident RCA
March 31, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
18 min
Peak error rate
92%
Users impacted
~8,000
Revenue impact
Checkout abandonment
Status
Resolved
March 31, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
18 min
Peak error rate
92%
Users impacted
~8,000
Revenue impact
Checkout abandonment
Status
Resolved
On February 7, 2024, the checkout-api service entered a crash loop shortly after deploying v2.15.0. Pods repeatedly crashed with panic: nil pointer dereference in PaymentService.Process(). The incident lasted 18 minutes and affected approximately 8,000 users attempting checkout. Root cause was a missing nil check in the new payment flow code. Rollback to v2.14.3 restored service within 6 minutes of the decision.
The primary root cause was a nil pointer dereference in PaymentService.Process() introduced in checkout-api v2.15.0. The new code path did not validate that the Stripe client was initialized before use when Redis session cache was unavailable. Under the cascade (Redis failover + Stripe latency), the code hit the unvalidated path and panicked.
A contributing factor was the Redis cluster failure, which increased auth-service load and caused 94% cache miss rate. This shifted load to the database and created additional latency that exposed the nil pointer path more frequently.
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| [ ] P1 | Add nil pointer validation to all error paths in PaymentService | @bob_dev | 2024-02-09 | In Progress |
| [ ] P1 | Add unit tests for Stripe API timeout scenarios in checkout-api | @bob_dev | 2024-02-09 | Open |
| [ ] P1 | Implement mandatory canary deployment phase for checkout-api | @alice_sre | 2024-02-14 | Open |
| [ ] P2 | Increase Redis cluster redundancy (6 → 9 nodes with 3 AZ spread) | @charli_e_db | 2024-02-21 | Open |
| [ ] P2 | Add automated rollback trigger on panic rate threshold (>5%) | @alice_sre | 2024-02-21 | Open |
| [ ] P3 | Add integration test suite for external API failure modes | @bob_dev | 2024-02-28 | Open |
Detection (09:00:00–09:00:02 UTC): The incident was detected within 2 seconds when the first checkout-api pod panicked. PagerDuty INC-8842 was auto-created via our Slack integration. Mean Time to Detect (MTTD): 2 seconds.
Response (09:00:02–09:00:35 UTC): On-call engineer @alice_sre acknowledged within 3 seconds. War room identified Stripe timeouts and Redis cluster failure as contributing factors. Decision to rollback was made 30 seconds after confirming the crash loop. Rollback initiated at 09:00:35.
Resolution (09:00:35–09:01:15 UTC): Rollback to v2.14.3 completed. Redis failover finished at 09:00:48. First successful checkout at 09:00:59. All pods healthy by 09:01:15. Mean Time to Resolve (MTTR): 59 seconds.
2024-02-07T09:00:00.123Z [FATAL] checkout-api-pod-7x9m2 panic: runtime error: invalid memory address or nil pointer dereference
goroutine 42 [running]: github.com/shop/checkout.(*PaymentService).Process(...)
{"ts":1707318005,"level":"error","msg":"Stripe API timeout","service":"payment-gateway","duration_ms":30000,"error":"context deadline exceeded"}
Feb 07, 2024 09:00:07 UTC [ERROR] auth-service Redis cluster: MOVED 15389 - connection to node failed after 3 attempts. Falling back to DB, cache miss rate: 94%
Your next incident deserves the same analysis.
Find root cause from your logs in 2 minutes.
Try with your logs