The Operators Regret: How We Blew Up the Event Bus at 3 AM
The article discusses the challenges faced by a team in ensuring exactly-once delivery of events in a complex system involving Kafka and Redis. After multiple attempts to resolve issues with event loss and lag, the team ultimately redesigned their architecture to simplify the process. They replaced Kafka Streams with a choreographed saga and introduced a dedicated service to manage event processing more effectively.
- ▪The original system architecture involved Kafka, Kafka Streams, and Redis, but suffered from event loss and lag issues.
- ▪Attempts to fix the problems included increasing Kafka Streams threads and using an outbox table in PostgreSQL, but these led to further complications.
- ▪The team ultimately redesigned their architecture by removing Kafka Streams and implementing a dedicated service called TreasureSink to manage event processing.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3942461) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Lillian Dube Posted on May 27 The Operators Regret: How We Blew Up the Event Bus at 3 AM #webdev #programming #architecture #systems The Problem We Were Actually Solving At 02:47 the Redis counters began to drift by as much as 18 %. Players who had just spent 300 gold on a dig turned around and screamed at Discord that the server had stolen their loot. We had a classic symptom: event loss. Our original topology was Kafka → Kafka Streams → Redis.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).