WeSearch

How well does S3 checkpointing hold up when running Airflow on spot?

·19 min read · 0 reactions · 0 comments · 0 views
How well does S3 checkpointing hold up when running Airflow on spot?

This article explores what actually happens when Apache Airflow runs on spot instances, using real experiments to simulate node preemption across both control plane and worker nodes. It walks through how tasks recover using retries, how S3 enables checkpointing without rerunning previous steps, and how to handle partial outputs through validation strategies like success markers. It also highlights the limitations of this approach, particularly around the Airflow metadata database, and outlines the architectural patterns required to build a fault-tolerant Airflow system on interruptible infrastructure.

Original article
Rackspace
Read full at Rackspace →
Opening excerpt (first ~120 words) tap to expand

Introduction‍Airflow can be deployed on spot instances to significantly reduce infrastructure costs.Based on current Rackspace Spot pricing, two spot instances could cost around $1.44 per month, while a single on-demand instance with similar CPU and memory specifications comes to roughly $21 per month.That difference largely comes from how the Rackspace Spot auction-based market works. Pricing is driven by competitive bids, which allows users to access unused capacity at much lower prices, in some cases as low as $0.001 per hour. You can find more context in this article on spot instance history and market dynamics.This auction-based market maintains preemption rates below 1%, meaning interruptions tend to be infrequent.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Rackspace.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Rackspace