Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

Arthur Douillard and the DiLoCo team· April 27, 2026 at 4:57 PM ·4 min read · 0 reactions · 0 comments · 0 views

Google’s new distributed architecture keeps AI training runs on track across distant data centers, with exceptional efficiency – even when hardware fails.

Original article

Google DeepMind · Arthur Douillard and the DiLoCo team

Read full at Google DeepMind →

Full article excerpt tap to expand

April 23, 2026 ResearchDecoupled DiLoCo: A new frontier for resilient, distributed AI trainingArthur Douillard and the DiLoCo team Share Copied Your browser does not support the video tag.<video class="media-video media-video--custom media-video--poster-pending" autoplay loop muted playsinline poster=/static/images/fallback/video_poster_fallback_light.57b61d57842a.jpg><source src="https://storage.googleapis.com/gdm-deepmind-com-prod-public/media/uEwZ_j5Su89wd5Om/DiLoCo-header-animation-16x9.webm#t=0.1" data-theme=light type=video/webm> Your browser does not support the video tag.</video>Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency.Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant logistical challenge.Today, in a new paper we are excited to share a new approach to this problem, called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.The result is a more resilient and flexible way to train advanced models across globally distributed data centers. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale.As frontier models continue to grow in scale and complexity, we’re exploring diverse approaches to train models across more compute, locations and varied hardware. Your browser does not support the video tag.<video class="media-video media-video--custom light-mode media-video--poster-pending" autoplay loop muted playsinline poster=/static/images/fallback/video_poster_fallback_light.57b61d57842a.jpg><source src="https://storage.googleapis.com/gdm-deepmind-com-prod-public/media/uEwZ_j5Su89wd5Om/Figure_1_animation_assets_light.webm#t=0.1" data-theme=light type=video/webm><source src="https://storage.googleapis.com/gdm-deepmind-com-prod-public/media/uEwZ_j5Su89wd5Om/Figure_1_animation_assets_dark.webm#t=0.1" data-theme=dark type=video/webm> Your browser does not support the video tag.</video>Figure 1: Decoupling training runs into separate “islands” of compute (learner units) allows largely uninterrupted training despite the same level of hardware failures, because the effects of those failures are isolated.Developing more fault-tolerant asynchronous training at scaleDecoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations.Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others.This infrastructure is also self-healing. In testing, we used a method called…

This excerpt is published under fair use for community discussion. Read the full article at Google DeepMind.

Anonymous · no account needed

Discussion

0 comments

Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

Discussion

More from Google DeepMind