Horror Stories from Former Azure Engineer
A former Azure engineer discusses the challenges and failures faced by the Azure team in managing virtual machine density. The push to increase VM capacity led to a significant rise in crashes and incidents, highlighting systemic issues within the infrastructure. The engineer's departure from the project underscores the difficulties in achieving a sustainable and secure cloud environment.
- ▪The Azure team aimed to increase the number of VMs per node from 16 to 48, with a long-term goal of 64.
- ▪This increase resulted in a 50% rise in crashes and incidents, indicating a failure in the scaling strategy.
- ▪The instance metadata services on Azure posed security risks by running a web service on the host OS, increasing the attack surface for potential breaches.
Opening excerpt (first ~120 words) tap to expand
How Microsoft Vaporized a Trillion Dollars, Pt. 4Inside the complacency and decisions that eroded trust in Azure—from a former Azure Core engineer.Axel RietschinApr 01, 20268013Share(Continued from Part 3)Azure has operated under constant strain for as long as I can remember.Even during the periodic “quality pushes,” the backlog of issues never shrank; it only grew.In the spring and summer of 2024, a major push began to raise the number of VMs each node could host.The business case was straightforward: scaling up density on existing servers is far cheaper than building new data centers. On-premise Azure deployments had always been capped at 16 VMs per node.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Substack.