Xiaomi open-sources MiMo-V2.5: 311B A15B 1M-context omnimodal model
Xiaomi has open-sourced MiMo-V2.5, a 311-billion-parameter omnimodal AI model with 15 billion activated parameters and support for up to 1 million tokens of context. The model integrates text, image, video, and audio understanding within a unified architecture and features hybrid attention, multi-token prediction, and efficient FP8 training. It is designed for strong performance in multimodal reasoning, long-context tasks, and agentic workflows.
- ▪MiMo-V2.5 is a sparse Mixture of Experts (MoE) model with 310 billion total parameters and 15 billion activated parameters.
- ▪The model supports a context length of up to 1 million tokens and uses a hybrid attention mechanism to reduce KV-cache storage.
- ▪It includes a 729M-parameter Vision Transformer and a dedicated audio encoder for native multimodal understanding.
- ▪MiMo-V2.5 was trained on approximately 48 trillion tokens using FP8 mixed precision.
- ▪The model incorporates Multi-Token Prediction and agentic reinforcement learning for improved inference and task performance.
Opening excerpt (first ~120 words) tap to expand
XiaomiMiMo / MiMo-V2.5 like 240 Follow Xiaomi MiMo 3.14k Safetensors English Chinese mimo_v2 multimodal vision-language audio agent video-understanding long-context custom_code Eval Results fp8 License: mit Model card Files Files and versions xet Community 15 MiMo-V2.5 1. Introduction Model Summary 2. Downloads 3. Evaluation Results Multimodal BenchmarksCoding & Agent BenchmarksLong Context Benchmarks4. Model Architecture LLM BackboneVision EncoderAudio Encoder5. Training Process 6.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Huggingface.