On March 2, 2026, Alibaba open-sourced Qwen3.5. The 9B version achieves 81.7 on GPQA Diamond, surpassing OpenAI's GPT-OSS-120B (71.5). Despite a 13.5x parameter gap, the small model wins.
Released under Apache 2.0 license, both code and weights are available for commercial use. Run with a single Ollama command, deploy on standard laptops.
Figure 1: Qwen3.5 Small Model Performance Comparison (Source: GitHub README)
Qwen3.5 Small Model Series
On March 2, 2026, Alibaba Qwen team open-sourced 4 Qwen3.5 small-sized models: Qwen3.5-0.8B, 2B, 4B, and 9B.
This isn't a "shrunk version". This series uses native multimodal training with the latest model architecture.
Figure 2: Qwen3.5 Middle Size Model Performance (Source: GitHub README)
Model Positioning
| Model | Positioning | Features | Use Cases |
|---|---|---|---|
| 0.8B/2B | Edge-first | Tiny size, ultra-fast inference | Mobile devices, IoT, real-time interaction |
| 4B | Lightweight Agent | Multimodal base | Agent core |
| 9B | Compact size, exceptional performance | Competes with 120B | Server-side, memory-constrained |
0.8B and 2B are suitable for mobile devices and IoT edge deployment. 4B is ideal for lightweight agents. 9B is perfect for server-side deployment with excellent cost-performance ratio.
9B vs 120B: Benchmark Data
GPQA Diamond benchmark results:
| Model | GPQA Diamond | Parameters | Type |
|---|---|---|---|
| Qwen3.5-9B | 81.7 | 9B | End-to-End |
| GPT-OSS-120B | 71.5 | 120B | End-to-End |
9B outperforms 120B by 10.2 points.
VentureBeat's headline was direct: "Alibaba's small, open source Qwen3.5-9B beats OpenAI's gpt-oss-120B and can run on standard laptops".
"Can run on standard laptops" means what? 9B model, memory usage approximately 4-5GB. RTX 3090, A10, or even high-end laptop GPUs can run it. No need for A100, H100 datacenter-grade GPUs.
To run 120B models before, you needed at least 8 A100s. Now 9B model, single GPU is enough. The cost difference is orders of magnitude.
Technical Highlights: Why Small Models Can Win?
Qwen3.5 isn't "distillation" or "pruning". There are several technical breakthroughs:
1. Unified Vision-Language Foundation
Early fusion training with trillions of multimodal tokens. Qwen3.5 surpasses Qwen3-VL models in reasoning, encoding, agent capabilities, and multimodal understanding.
Figure 3: Qwen3.5 Flagship Model Performance Comparison (Source: GitHub README)
2. Efficient Hybrid Architecture
Gated Delta Networks combined with sparse MoE (Mixture-of-Experts). High-throughput inference, low latency.
Qwen3.5-397B-A17B has 397B total parameters, only activating 17B per forward pass. Qwen3.5-9B doesn't disclose MoE configuration but inherits the same architectural philosophy.
3. Scalable RL Generalization
Scale reinforcement learning in millions of agent environments. Not optimization for specific benchmarks, but real-world adaptability.
4. Global Language Coverage
Expanded from 119 to 201 languages. Vocabulary expanded from 150k to 250k, encoding/decoding efficiency improved by 10-60%.
Deployment: One Command
How simple is deploying Qwen3.5-9B? Ollama one command:
ollama run qwen3.5:9b
That's it.
Using transformers:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B")
Memory Usage
- bfloat16 precision: ~4-5GB
- int8 quantization: ~2-3GB
- int4 quantization: ~1-2GB
Inference Speed (Single RTX 3090)
- Generation speed: ~30-50 tokens/second
- First token latency: <100ms
Comparison with 120B model:
- Memory usage: ~240GB (bfloat16)
- Requires: 8 A100s (80GB each)
- Inference speed: ~5-10 tokens/second
The difference is clear.
Selection Guide: How to Choose 0.8B/2B/4B/9B?
| Requirement | Recommended Model | Reason |
|---|---|---|
| Mobile deployment | 0.8B/2B | Tiny size, ultra-fast inference |
| IOT edge devices | 0.8B/2B | Low resource consumption |
| Lightweight Agent | 4B | Balance performance and resources |
| Server general use | 9B | Best cost-performance |
| Memory <4GB | 0.8B/2B | Minimum resource requirements |
| Memory 4-8GB | 4B/9B | Medium resource requirements |
| Pursue maximum performance | 9B | 接近 120B performance |
Recommendations
- Ample memory (≥8GB): Go straight for 9B
- Mobile development: Choose 2B
- Agent development: 4B is the sweet spot
Conclusion: The Era of Small Models
Qwen3.5-9B open-source marks a new trend: small models are no longer a "compromise", but a "choice".
Previously thought: performance = parameters. The fact that 9B surpasses 120B tells us: architecture optimization > stacking parameters.
This is good news for developers. Previously only cloud API calls, now can deploy locally. Previously worried about data privacy, now can run completely offline. Previously too expensive, now single GPU is enough.
Resources
- GitHub: https://github.com/QwenLM/Qwen3.5
- ModelScope: https://modelscope.cn/collections/Qwen/Qwen35
- HuggingFace: https://huggingface.co/collections/Qwen/qwen35
- Official Blog: https://qwen.ai/blog?id=qwen3.5
- Qwen Chat: https://chat.qwen.ai
Data Sources
- GitHub README (QwenLM/Qwen3.5)
- VentureBeat coverage (2026-03-02)
- Alibaba official blog (qwen.ai)
- GPQA Diamond official leaderboard