What is Microsoft Phi-4-reasoning-vision-15B?

It is a 15 billion parameter open-weight multimodal AI model from Microsoft that can process both images and text, excelling at math, science reasoning, and GUI understanding while selectively deciding when to use chain-of-thought reasoning.

How does Phi-4-reasoning-vision compare to larger models?

Despite having only 15B parameters, Phi-4-reasoning-vision matches or exceeds many larger models on key benchmarks, scoring 84.8 on AI2D and 75.2 on MathVista while being trained on roughly one-fifth the data of competitors.

Is Phi-4-reasoning-vision open source?

Yes, Microsoft released Phi-4-reasoning-vision-15B as an open-weight model available through Hugging Face, GitHub, and Microsoft Foundry.

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Small Model That Knows When to Think

Microsoft has released Phi-4-reasoning-vision-15B, a compact multimodal AI model that introduces a novel capability most competitors lack: the ability to decide for itself when deep reasoning is worth the effort.

The model, available as open weights on Hugging Face and Microsoft Foundry, represents a significant step forward in making powerful AI reasoning accessible without requiring massive infrastructure.

A Model That Chooses When to Think

Most reasoning models apply chain-of-thought processing to every query, regardless of complexity. Microsoft's research team recognized this is often counterproductive — for straightforward tasks like image captioning or reading a receipt, extended reasoning can actually degrade performance.

Phi-4-reasoning-vision ships as what Microsoft calls a "mixed reasoning and non-reasoning model." It activates deep chain-of-thought processing for complex math and science problems while suppressing it for simpler visual tasks. This selective approach yields better results across a wider range of use cases.

Punching Above Its Weight

At 15 billion parameters, the model is a fraction of the size of leading alternatives. Yet its benchmark results tell a compelling story. Phi-4-reasoning-vision scores 84.8 on AI2D, 83.3 on ChartQA, 75.2 on MathVista, and 88.2 on ScreenSpot v2 — competitive with similarly sized systems and not far behind models with twice the parameter count.

Perhaps more impressive is the training efficiency. Microsoft trained the entire system on roughly 200 billion tokens of multimodal data using just 240 NVIDIA B200 GPUs over four days. That is approximately one-fifth of the training data consumed by comparable models from Alibaba's Qwen family or Google's Gemma series.

Architecture and Design

Under the hood, Phi-4-reasoning-vision uses a mid-fusion architecture pairing a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. This design allows the model to process visual and textual information in an integrated pipeline while maintaining efficiency.

The model handles a broad array of tasks: interpreting scientific charts, solving multi-step math problems, navigating graphical user interfaces, reading documents, and performing everyday visual recognition.

Implications for the Industry

The release continues a trend toward capable small models that can run on more modest hardware. For enterprises evaluating AI deployment, Phi-4-reasoning-vision offers a compelling trade-off between performance and computational cost.

The selective reasoning approach also points toward a broader shift in model design philosophy. Rather than building ever-larger models that apply maximum compute to every query, the field is moving toward systems that allocate resources intelligently based on task complexity — a pattern that could reshape how AI inference costs scale in production environments.

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Small Model That Knows When to Think

A Model That Chooses When to Think

Punching Above Its Weight

Architecture and Design

Implications for the Industry

More in Models

xAI Launches Grok Voice Think Fast 1.0, Tops τ-Voice Bench and Powers Starlink Support

Tencent Drops Hy3 Preview: 295B Open-Source MoE Model Kicks DeepSeek Out of Yuanbao

DeepSeek V4 Preview Lands: 1.6T-Parameter Open Model With 1M Context, Flash Pricing at $0.14/M