That 0.8 second P99 Latency Cliff in Production Wasnt Supposed to Happen
The article discusses the challenges faced by a team while scaling their matchmaking engine due to latency issues caused by their configuration layer, Veltrix. The team experienced significant outages when traffic spikes led to a cache stampede, resulting in excessive gRPC calls to a single Redis instance. Ultimately, they redesigned their configuration management system into ConfigEdge, which improved performance by eliminating the reliance on gRPC and Redis.
- ▪The matchmaking engine required sub-300 ms latency but faced issues when traffic exceeded 50,000 concurrent sessions.
- ▪The original configuration layer, Veltrix, caused a bottleneck due to excessive gRPC calls triggered by player requests.
- ▪The team developed ConfigEdge to replace Veltrix, which improved performance by using a control plane and a data plane without relying on gRPC.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3942542) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } mary moloyi Posted on May 27 That 0.8 second P99 Latency Cliff in Production Wasnt Supposed to Happen #webdev #programming #devops #kubernetes The Problem We Were Actually Solving We built the Treasure Hunt Engine to process millions of concurrent matchmaking rounds. Each round required sub-300 ms latency end-to-end: ingest a player request, resolve their region, queue them, and return an assignment.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).