Xiaohongshu has open-sourced FireRed-OCR, a 2B parameter model that achieves 92.94% on OmniDocBench v1.5. To put this in perspective: it outperforms Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). Released under Apache 2.0 license, both code and weights are available for commercial use.
The "Structural Hallucination" Problem in Document Parsing
General-purpose large models have a common weakness when reading PDFs: they recognize text accurately but struggle with structure.
Typical issues include:
- Table rows and columns scrambled, data mismatched
- Mathematical formulas with "creative" additions - extra symbols appearing out of nowhere
- Multi-column documents with jumbled reading order, cross-column confusion
This isn't an occasional bug. General VLMs are trained to generate semantically coherent text but lack precise spatial constraints for document pixel-level structures.
FireRed-OCR's approach is straightforward: transform a general VLM into a "structural engineer" with a systematic training framework that enforces format and syntax constraints.
Technical Solution: Three-Stage Training + Format-Constrained GRPO
FireRed-OCR isn't a simple fine-tune - it's a complete training pipeline:
Stage 1: Multi-Task Pre-Alignment
Build "spatial foundations" at the visual perception level. The model learns object detection, region recognition, and layout-to-Markdown mapping, establishing the groundwork for spatial localization.
Stage 2: Specialized Supervised Fine-Tuning (SFT)
Precision-tune on high-quality, standardized Markdown datasets to ensure logical consistency and hierarchical expression capabilities.
Stage 3: Format-Constrained Reinforcement Learning
The core innovation lies here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces dedicated format reward signals across four dimensions:
| Dimension | Reward Signal |
|---|---|
| Formula Syntax Correctness | LaTeX validity |
| Table Structure Integrity | Tag closure |
| Hierarchical Tag Closure | Markdown nesting correctness |
| Text Accuracy | Character-level recognition precision |
Every output is scored across these four dimensions, with feedback sent to the model for self-correction.
Performance Benchmarks: What 92.94% Means
FireRed-OCR-2B's performance on OmniDocBench v1.5:
| Model | Overall Score | Parameters | Type |
|---|---|---|---|
| FireRed-OCR-2B | 92.94% | 2B | End-to-End |
| Qwen3.5-397B | 90.80% | 397B | End-to-End |
| Gemini-3.0 Pro | 90.33% | - | End-to-End |
| DeepSeek-OCR 2 | 91.09% | - | End-to-End |
| GLM-OCR | 94.60% | - | Pipeline |
| PaddleOCR-VL-1.5 | 94.50% | 1.5B | Pipeline |
FireRed-OCR is the optimal end-to-end single model solution. PaddleOCR-VL-1.5 and GLM-OCR are pipeline approaches (multiple specialized models chained together) with higher scores but more complex deployment.
For text recognition alone (OCRBench TextRec), FireRed-OCR-2B scores 93.5, ranking first among all models, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).
In FireRedBench (the team's custom "stress test" benchmark featuring real-world non-standard documents), FireRed-OCR-2B scores 74.62, taking first place among end-to-end solutions, surpassing pipeline GLM-OCR (74.33), and trailing only PaddleOCR-VL-1.5 (76.47).
The base model Qwen3-VL-2B-Instruct scores only 65.58, showing significant improvement.
Deployment: Ready in Minutes
With 2B parameters and bfloat16 precision, memory usage is approximately 4-5GB. A single RTX 3090 / A10 GPU is sufficient for smooth inference.
Installation:
pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR
Inference Example:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv
model = Qwen3VLForConditionalGeneration.from_pretrained(
"FireRedTeam/FireRed-OCR",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text) # Standard Markdown format
Performance Optimization Tips:
- Enable
flash_attention_2for significantly reduced peak memory and improved throughput max_new_tokensdefaults to 8192; for dense academic papers, maintain or increase this value- Image quality matters significantly; provide images ≥150 DPI for best results
Ideal Use Cases
FireRed-OCR excels at documents requiring structural integrity:
- Academic papers (with formulas)
- Financial reports and tables
- Technical documentation
- Multi-column book scans
For downstream tasks where "tables must not break" and "formulas must not err" are critical requirements, it's currently the most reliable end-to-end choice.
When Not to Use
- For extreme precision requirements with engineering resources to maintain multi-model systems → Choose PaddleOCR-VL-1.5 or GLM-OCR
- For very poor quality scans (<100 DPI) → Performance degrades significantly
Conclusion
FireRed-OCR demonstrates the power of "specialized optimization": through a carefully designed training framework, a 2B model beats 235B general-purpose models on vertical tasks.
For specialized tasks, targeted training is more efficient than scaling model size.
Resources
- GitHub: https://github.com/FireRedTeam/FireRed-OCR
- ModelScope: https://modelscope.cn/models/FireRedTeam/FireRed-OCR
- Technical Report: https://arxiv.org/abs/2603.01840
- Demo: https://huggingface.co/spaces/FireRedTeam/FireRed-OCR