Porting a Scratch-Built 500M LLM Training Pipeline to ROCm on Strix Halo

Apr 28, 2026 · 5:39 AM UTC ·5 min read · 0 reactions · 0 comments · 0 views

A lightweight transformer language model built from scratch in PyTorch, trained on a single consumer GPU with a full pipeline for data processing, pretraining, and instruction tuning. - epscylonb/1...

Original article

GitHub

Read full at GitHub →

Full article excerpt tap to expand

1386.ai.rocm This is a fork of 1386.ai ported to ROCm, targeting specifically the AMD Strix Halo APU but compatible with any ROCm-supported hardware. I found this repo through a Reddit post where the author (@eb1386) nonchalantly announced it after training a 235M-parameter model. Unlike most toy LLM implementations, this one is end-to-end — data prep, training, and fine-tuning included. The code is clean and accessible, making it an excellent reference for small-model training. Sadly the author has deleted their original post and comments, but you can see others' feedback here. Regarding ROCm support on Strix Halo, there's good news and bad news. The good news: despite ROCm's reputation lagging behind CUDA, virtually no PyTorch-specific code changes were needed to train a 500M-parameter model here. PyTorch's ROCm backend is genuinely solid. The bad news: training a 500M-parameter model on the 128 GB Strix Halo APU (in a GMKTec Evo X2 mini PC) will take roughly three weeks. I'm seeing ~4,750 tokens/s — there's likely not much low-hanging fruit left without writing custom CUDA kernels or deeper fused-operator optimizations. Summary of Changes dataset.py The original author omitted ShardDataset and StreamingShardDataset classes, so I have naively implemented these Random shuffling of training data has been added to ensure that the model isn't trained on previously seen data when resuming training from a checkpoint torch.compile Added to increase training perf Training workers changed from 2 to 0 (running on the main thread) Couldn't get training to start using workers Added a Dockerfile and run-docker.sh helper script ROCm drivers and libraries are notoriously difficult to install, configure, and maintain Using a container avoids breaking the host with bad installs and config Using the latest image from https://hub.docker.com/r/rocm/pytorch/tags Quick Start on Strix Halo Consider editing the ENV vars in the run-docker.sh script to match your hardware and huggingface config. # Build the image (base is > 6 GB) docker build -t 1386-rocm . # Run an interactive session: bash run-docker.sh Inside the container, follow the original instructions to download data and begin training. What follows is the original readme from the forked repo. 1386.ai A lightweight transformer language model built from scratch in PyTorch, trained on a single consumer GPU with a full pipeline for data processing, pretraining, and instruction tuning. No pretrained weights, no HuggingFace model downloads. Every weight is learned from raw text on a single RTX 5080 using bf16 mixed precision with gradient checkpointing. The training infrastructure handles everything from data download through evaluation. The current release is Plasma 1.0 (235M parameters). Plasma 1.1 (500M parameters, multi-turn conversation support, upgraded data pipeline) is in development. Architecture The model follows the LLaMA architecture with modern training techniques throughout. Attention uses Grouped-Query Attention (GQA) with query heads mapped to fewer key-value heads, reducing memory bandwidth during inference while maintaining quality. All positional information comes from Rotary Positional Embeddings (RoPE), encoding position directly into the attention computation rather than through learned position embeddings. KV caching is supported for fast autoregressive generation. Feed-forward layers use SwiGLU, a gated activation function that replaces the traditional ReLU MLP. SwiGLU uses three…

This excerpt is published under fair use for community discussion. Read the full article at GitHub.

Anonymous · no account needed

Discussion

0 comments

Porting a Scratch-Built 500M LLM Training Pipeline to ROCm on Strix Halo

Discussion

More from GitHub