Skip to content
Model Releaseindustry

Xiaomi ships MiMo-V2.5-Pro with text, image, audio, and video in a single multimodal model

By Harsh Desai

TL;DR

Xiaomi released MiMo-V2.5-Pro on April 22, 2026 alongside MiMo-V2.5. The Pro variant combines text, image, audio, and video into a single multimodal model, matching frontier benchmark quality at significantly lower token cost. Available in public beta. Five weeks after MiMo-V2-Pro launched.

What shipped

Xiaomi launched two models on April 22, 2026:

  1. MiMo-V2.5: the base multimodal model.
  2. MiMo-V2.5-Pro: the flagship, combining text, image, audio, and video in a single unified architecture.

The Pro variant is the interesting one: rather than stitching together a text model with a vision adapter and an audio encoder (the common approach), MiMo-V2.5-Pro handles all four modalities natively in one model.

Benchmark positioning

Xiaomi's claim: MiMo-V2.5-Pro matches the benchmarks of frontier closed-model competitors (GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro) while running at a significantly lower token cost.

Key specs:

  • Vision image understanding at 1024x1024 base resolution, comparable to Gemini 3.1 Pro.
  • Audio native speech-to-text + audio understanding (noise recognition, speaker identification).
  • Video frame-level understanding plus temporal reasoning.
  • Text strong coding and reasoning benchmarks within the Pro variant, usable as a general-purpose text model.

Pricing edge

"Significantly lower token cost" is Xiaomi's specific framing. Exact pricing comparison varies by use case, but early third-party analysis suggests MiMo-V2.5-Pro delivers roughly 60-70% the capability of Gemini 3.1 Pro at 30-40% of the token cost. For high-volume workloads, this is a material difference.

Xiaomi's AI strategy

MiMo-V2 launched five weeks before this release (March 2026). That pace: major model iterations on a ~5-week cycle: matches the Chinese frontier lab cadence (Qwen, DeepSeek, Moonshot) and outpaces most Western labs on raw iteration speed.

MiMo is the AI substrate Xiaomi is building into its consumer products: phones, cars, smart home. Releasing the model publicly is partly distribution strategy (developers adopt it, integrate it into products) and partly benchmark signalling (prove Xiaomi is not just a hardware company).

Public beta availability

MiMo-V2.5-Pro is in public beta as of the April 22 release. Access via Xiaomi's AI portal; requires free account. Production-ready API pricing will follow beta graduation.

Who this matters for

  • Vibe Builder: Multimodal agent that understands screenshots plus voice plus video natively, at lower token cost than Gemini 3.1 Pro. Test on your actual workflow during the beta.
  • Basic User: Access via Xiaomi's AI portal with a free account. If you are in China, integration into phones and cars is coming soon.
  • Developer: Unified multimodal architecture means one model instead of four stitched together. Benchmark comparison vs Gemini 3.1 Pro at 30-40% of token cost is the deployment-economics signal to verify.

What to watch next

The unified multimodal architecture is the part that most coverage will underplay. Handling text, image, audio, and video in a single model rather than four stitched-together components means cross-modal reasoning is a native capability, not an emergent one. For anyone building agents that process screenshots plus voice plus text plus video, this changes what is practical.

For vibe builders, the relevant question is whether MiMo-V2.5-Pro's real-world quality matches its benchmark claims. Xiaomi is five weeks from its previous release, moving fast, and the benchmark positioning is aggressive. The 5-week cadence means these models improve fast, but individual releases sometimes ship with rough edges. Public beta is the right stage to evaluate without committing.

Pricing is the adoption lever. If "30-40% of Gemini 3.1 Pro's cost at 60-70% of the capability" holds in independent benchmarking, that is a material advantage for high-volume workloads where absolute frontier capability is not required. Agent workflows that run at scale (thousands of tool calls per day per user) care about cost per inference far more than about top-benchmark position.

Xiaomi's distribution angle is interesting. Most Chinese frontier labs are standalone AI companies (Moonshot, MiniMax, Z.AI, DeepSeek). Xiaomi is primarily a hardware and consumer-electronics company that happens to have a frontier AI lab. That gives MiMo access to real consumer deployment (phones, cars, smart home) that pure-AI labs do not have. Whether that turns into a durable advantage depends on whether consumer integration data flows back into model training as a signal.

Public beta plus "frontier benchmarks at lower cost" is the right positioning for agency and indie-developer adoption. Budget-constrained teams that would not choose Gemini 3.1 Pro on price will try MiMo-V2.5-Pro. If it delivers, Xiaomi wins the bottom-up adoption pattern.

by Harsh Desai

Source:marktechpost.com

More from general

  • Model Release
    Alibaba ships Qwen 3.6-Max-Preview as closed weights, tops six coding benchmarks

    Alibaba released Qwen 3.6-Max-Preview on April 20, 2026. The model ranks first on six coding benchmarks (SWE-Bench Pro, Terminal-Bench 2.0, SkillsBench, QwenClawBench, QwenWebBench, SciCode) and ships with dual OpenAI/Anthropic API compatibility. Notably, this is the first Qwen flagship to ship as closed weights rather than open-source, and Alibaba simultaneously shuttered the free tier of Qwen Code.