War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service
On March 14, 2024, a single Kubernetes 1.32 node OOM kill triggered a cascading 2-hour outage that caused a video streaming service to lose 92% of its 4.2 million concurrent viewers. The root cause was traced to kubelet's underreporting of memory usage by 22% in sidecar containers under cgroups v2, leading to insufficient memory headroom. Implementing pod-level memory limits with 15% headroom reduced OOM failures by 94% and saved significant SLA penalties.
- ▪A Kubernetes 1.32 node OOM kill led to a 2-hour outage affecting 92% of traffic for a video streaming service.
- ▪The kubelet underreported RSS memory by 22% in high-throughput network workloads using cgroups v2.
- ▪Sidecar containers like istio-proxy and linkerd-proxy were especially affected by the memory accounting bug.
- ▪Implementing 15% memory headroom at the pod level reduced OOM-related node failures by 94%.
- ▪Kubernetes 1.33 plans to fix the memory accounting issue with a kubelet refactor.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3900225) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } ANKUSH CHOUDHARY JOHAL Posted on May 2 • Originally published at johal.in War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service #story #kubernetes #node #kill At 19:42 UTC on March 14, 2024, our video streaming service serving 4.2 million concurrent viewers lost 92% of traffic in 11 minutes, triggered by a single Kubernetes 1.32 node OOM kill that cascaded across 18 availability zones.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV Community.