The Apple Neural Engine Inference Book
The Apple Neural Engine Inference Book serves as a comprehensive guide for practitioners working with production inference on Apple's Neural Engine. It covers various topics including CoreML, Swift runtimes, and model validation. The book includes chapters on empirical rules, porting recipes, quantization, and more.
- ▪The book provides a detailed overview of production inference on the Apple Neural Engine.
- ▪It includes chapters on topics such as quantization, shard sizing, and stateful KV cache.
- ▪The source code and additional resources are available in the ane-book repository.
Opening excerpt (first ~120 words) tap to expand
The Apple Neural Engine Inference Book A practitioner’s guide to production inference on the Apple Neural Engine with CoreML, Swift runtimes, ANE-only residency checks, and validated model manifests. By Alvaro Videla - @old_sound Chapters Chapter Topic 00 - Modern Inference Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick 01 - ANE Laws Empirical rules: shard limits, quantization, residency 02 - Porting Recipe GGUF to CoreML, step by step 03 - Quantization INT8 production, INT4 tradeoffs, the silent CPU fallback 04 - Shard Sizing Layer count vs size, 250 MB limit, LM-head splits 05 - Stateful KV Cache MLState, Swift daemon design, decode loop 06 - RangeDim + Speculative Variable T, n-gram acceptance 07 - MoE on ANE Soft routing, per-expert dispatch, ZAYA and Privacy…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Alvaro-videla.