When Retries Turn Hostile — How Control Logic Kills Production Systems
Retries in production systems, intended to handle failures, can exacerbate outages when not carefully designed, as seen in the 2012 Knight Capital incident that resulted in $440 million in losses. Patterns like dogpile effects, cascading failures, and long timeouts can create self-inflicted system damage during recovery. Safe retry strategies such as exponential backoff, jitter, and retry budgets are essential to prevent destructive collective behavior.
- ▪Knight Capital lost $440 million due to a retry-like feedback loop from legacy code activating during deployment.
- ▪Retries can worsen outages by overwhelming already degraded services with repeated requests.
- ▪Exponential backoff, jitter, and retry budgets are recommended strategies to prevent retry storms.
- ▪Slow responses are more harmful than errors because they tie up system resources like threads and connections.
- ▪Idempotency is a prerequisite for safe retries; non-idempotent endpoints can create unintended side effects when retried.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3800250) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Ken Imoto Posted on May 1 When Retries Turn Hostile — How Control Logic Kills Production Systems #sre #devops #reliability #programming "Your retries are killing us." A service team received this message from a downstream dependency during an outage. The upstream API was timing out, so naturally, the client retried. 3 times, 5 times, 10 times. The client thought it was doing the right thing.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).