Porting a Scratch-Built 500M LLM Training Pipeline to ROCm on Strix Halo
The article details the porting of a scratch-built 500M-parameter LLM training pipeline to AMD's ROCm platform, specifically targeting the Strix Halo APU. Despite minimal code changes needed due to PyTorch's robust ROCm support, training on Strix Halo hardware remains slow, taking approximately three weeks. The implementation includes a full training pipeline with data preprocessing, training, and fine-tuning, and is containerized for easier deployment.
- ▪The 1386.ai repository has been ported to ROCm, enabling compatibility with AMD's Strix Halo APU and other ROCm-supported hardware.
- ▪Training a 500M-parameter model on the 128 GB Strix Halo APU takes about three weeks, achieving roughly 4,750 tokens per second.
- ▪Minimal PyTorch code modifications were required, highlighting the maturity of PyTorch's ROCm backend for LLM training.
- ▪Custom optimizations like fused operators or kernel-level changes may be needed to significantly improve performance beyond current levels.
- ▪A Dockerfile and helper script are included to simplify environment setup and avoid host system disruptions from ROCm installations.
Opening excerpt (first ~120 words) tap to expand
1386.ai.rocm This is a fork of 1386.ai ported to ROCm, targeting specifically the AMD Strix Halo APU but compatible with any ROCm-supported hardware. I found this repo through a Reddit post where the author (@eb1386) nonchalantly announced it after training a 235M-parameter model. Unlike most toy LLM implementations, this one is end-to-end — data prep, training, and fine-tuning included. The code is clean and accessible, making it an excellent reference for small-model training. Sadly the author has deleted their original post and comments, but you can see others' feedback here. Regarding ROCm support on Strix Halo, there's good news and bad news.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.