When Retries Turn Hostile — How Control Logic Kills Production Systems

May 1, 2026 · 5:04 PM UTC ·5 min read · 0 reactions · 0 comments · 3 views

#reliability #devops #sre #programming #system design

When Retries Turn Hostile — How Control Logic Kills Production Systems

⚡ TL;DR · AI summary

Retries in production systems, intended to handle failures, can exacerbate outages when not carefully designed, as seen in the 2012 Knight Capital incident that resulted in $440 million in losses. Patterns like dogpile effects, cascading failures, and long timeouts can create self-inflicted system damage during recovery. Safe retry strategies such as exponential backoff, jitter, and retry budgets are essential to prevent destructive collective behavior.

Key facts

▪Knight Capital lost $440 million due to a retry-like feedback loop from legacy code activating during deployment.
▪Retries can worsen outages by overwhelming already degraded services with repeated requests.
▪Exponential backoff, jitter, and retry budgets are recommended strategies to prevent retry storms.
▪Slow responses are more harmful than errors because they tie up system resources like threads and connections.
▪Idempotency is a prerequisite for safe retries; non-idempotent endpoints can create unintended side effects when retried.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3800250) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Ken Imoto Posted on May 1 When Retries Turn Hostile — How Control Logic Kills Production Systems #sre #devops #reliability #programming "Your retries are killing us." A service team received this message from a downstream dependency during an outage. The upstream API was timing out, so naturally, the client retried. 3 times, 5 times, 10 times. The client thought it was doing the right thing.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

When Retries Turn Hostile — How Control Logic Kills Production Systems

Discussion

More from DEV.to (Top)