The Backup That Wasn't
On January 31, 2017, a GitLab engineer accidentally deleted 300 GB of production data from the primary PostgreSQL database while intending to clean up a secondary instance. Despite having five backup mechanisms in place, all were found to be non-functional when needed, exposing critical gaps in monitoring and verification. The incident became a landmark case in operational reliability, prompting widespread adoption of more rigorous backup validation practices.
- ▪An engineer mistakenly ran 'rm -rf' on the primary database instead of the secondary due to nearly identical terminal prompts.
- ▪All five of GitLab's backup mechanisms failed: pg_dump produced empty files, alert emails were silently rejected, LVM snapshots were unscheduled, Azure disk snapshots were misconfigured, and the only usable backup was an ad-hoc snapshot tak
- ▪The only recoverable data came from an unintended LVM snapshot created for staging purposes, which coincidentally contained the latest consistent state of the database.
- ▪GitLab responded transparently by live-streaming recovery efforts, publishing internal logs, and releasing a detailed postmortem that became a standard reference in system administration.
- ▪The incident underscored the importance of regularly testing backups through restoration rather than assuming their validity based on successful execution logs.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3841501) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Vivian Voss Posted on Apr 30 • Originally published at vivianvoss.net The Backup That Wasn't #postmortem #backup #postgres #sysadmin Tales from the Bare Metal, Episode 01 « Thou shalt not trust a backup thou hast not restored! » At half past eleven on the night of Tuesday, 31 January 2017, an engineer at GitLab.com typed rm -rf on what they believed was the secondary PostgreSQL database. The terminals on their screen were visually identical, save for the hostname in the prompt.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).