Xiaomi open-sources MiMo-V2.5: 311B A15B 1M-context omnimodal model
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Full article excerpt tap to expand
XiaomiMiMo / MiMo-V2.5 like 103 Follow Xiaomi MiMo 2.27k Safetensors English Chinese mimo_v2 multimodal vision-language audio agent video-understanding long-context custom_code Eval Results fp8 License: mit Model card Files Files and versions xet Community 3 MiMo-V2.5 1. Introduction Model Summary 2. Downloads 3. Evaluation Results Multimodal BenchmarksCoding & Agent BenchmarksLong Context Benchmarks4. Model Architecture LLM BackboneVision EncoderAudio Encoder5. Training Process 6. Deployment SGLang DeploymentvLLM DeploymentCitation Contact | 🤗 HuggingFace | 📰 Blog | 🎨 Xiaomi MiMo API Platform | 🗨️ Xiaomi MiMo Studio | Community WeChat Group | Discord | Telegram | Reddit MiMo-V2.5 1. Introduction MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include: Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias. Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding. Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency. Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens. Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks. Model Summary Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters Context Length: Up to 1M tokens Modalities: Text, Image, Video, Audio Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full) Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full) Multi-Token Prediction (MTP): 329M parameters, 3 layers 2. Downloads Model Context Length Download MiMo-V2.5-Base 256K 🤗 HuggingFace 🤖 ModelScope MiMo-V2.5 1M 🤗 HuggingFace 🤖 ModelScope 3. Evaluation Results Multimodal Benchmarks Coding & Agent Benchmarks Long Context Benchmarks 4. Model Architecture LLM Backbone MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention. Component MiMo-V2.5-Pro MiMo-V2.5 Total Parameters 1.02T 310B Activated Parameters 42B 15B Hidden Size 6144 4096 Num Layers 70 (1 dense + 69 MoE) 48 (1 dense + 47 MoE) Full Attention Layers 10 9 SWA Layers 60 39 Num Attention Heads 128 64 Num KV Heads 8 (GQA) 8 (GA) / 4 (SWA) Head Dim (QK / V) 192 / 128 192 / 128 Routed Experts 384 256 Experts per Token 8 8 MoE Intermediate Size 2048 2048 Dense Intermediate Size 16384 (layer 0 only) 16384 (layer 0 only) SWA Window Size 128 128 Max Context Length 1M 1M MTP Layers 3 3 Vision Encoder We train a dedicated MiMo ViT that adopts…
This excerpt is published under fair use for community discussion. Read the full article at Huggingface.