War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service

May 2, 2026 · 2:53 AM UTC ·13 min read · 0 reactions · 0 comments · 6 views

#kubernetes #outage #memory management #cloud computing #devops #Kubernetes #Ubuntu 22.04 LTS #containerd #istio-proxy #linkerd-proxy #fluentd #ANKUSH CHOUDHARY JOHAL #johal.in

War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service

⚡ TL;DR · AI summary

On March 14, 2024, a single Kubernetes 1.32 node OOM kill triggered a cascading 2-hour outage that caused a video streaming service to lose 92% of its 4.2 million concurrent viewers. The root cause was traced to kubelet's underreporting of memory usage by 22% in sidecar containers under cgroups v2, leading to insufficient memory headroom. Implementing pod-level memory limits with 15% headroom reduced OOM failures by 94% and saved significant SLA penalties.

Key facts

▪A Kubernetes 1.32 node OOM kill led to a 2-hour outage affecting 92% of traffic for a video streaming service.
▪The kubelet underreported RSS memory by 22% in high-throughput network workloads using cgroups v2.
▪Sidecar containers like istio-proxy and linkerd-proxy were especially affected by the memory accounting bug.
▪Implementing 15% memory headroom at the pod level reduced OOM-related node failures by 94%.
▪Kubernetes 1.33 plans to fix the memory accounting issue with a kubelet refactor.

Original article

DEV Community

Read full at DEV Community →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3900225) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } ANKUSH CHOUDHARY JOHAL Posted on May 2 • Originally published at johal.in War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service #story #kubernetes #node #kill At 19:42 UTC on March 14, 2024, our video streaming service serving 4.2 million concurrent viewers lost 92% of traffic in 11 minutes, triggered by a single Kubernetes 1.32 node OOM kill that cascaded across 18 availability zones.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV Community.

Anonymous · no account needed

Discussion

0 comments

War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service

Discussion

More from DEV Community