Step 3.7 Flash – Open-source multimodal model for speed and agents
The Step 3.7 Flash model has been introduced as a high-efficiency open-source multimodal model designed for real-world agents. It features native multimodal understanding, enhanced web and visual search capabilities, and compatibility with various agent ecosystems. This model aims to improve agent efficiency and reduce integration costs for developers.
- ▪Step 3.7 Flash is capable of understanding images and writing code or calling tools based on visual input.
- ▪The model enhances web search by accessing more sources and recognizing long-tail entities.
- ▪It is compatible with mainstream agent harnesses, which lowers integration costs and simplifies workflows.
Opening excerpt (first ~120 words) tap to expand
2026-05-29 Step 3.7 Flash The new frontier is agent efficiency. A high-efficiency Flash model for real-world agents. Multimodal Understanding & Action|Web & Visual Search Enhancement|Reliable Tool Use & Orchestration|Agent Ecosystem Compatibility GitHub HuggingFace ModelScope Key Features Native Multimodal Understanding & Acting Understands images across the full range — product UIs, documents, charts, and natural scenes — then writes code or calls tools to act on what it sees. Web & Visual Search Enhancement Web search reaches further — more sources, deeper follow-up. Visual search recognizes what other systems don't — long-tail entities, freshly emerged concepts.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Stepfun.