Qwen-Image-2512
Guide March 2026

FireRed-OCR: 2B Model Beats 397B - Document Parsing SOTA (2026 Complete Guide)

Xiaohongshu's open-source FireRed-OCR achieves 92.94% on OmniDocBench v1.5, surpassing Qwen3.5-397B and Gemini Pro. Apache 2.0 licensed for commercial use.

FireRed-OCR Logo

Xiaohongshu has open-sourced FireRed-OCR, a 2B parameter model that achieves 92.94% on OmniDocBench v1.5. To put this in perspective: it outperforms Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). Released under Apache 2.0 license, both code and weights are available for commercial use.

FireRed-OCR Teaser

The "Structural Hallucination" Problem in Document Parsing

General-purpose large models have a common weakness when reading PDFs: they recognize text accurately but struggle with structure.

Typical issues include:

  • Table rows and columns scrambled, data mismatched
  • Mathematical formulas with "creative" additions - extra symbols appearing out of nowhere
  • Multi-column documents with jumbled reading order, cross-column confusion

This isn't an occasional bug. General VLMs are trained to generate semantically coherent text but lack precise spatial constraints for document pixel-level structures.

FireRed-OCR's approach is straightforward: transform a general VLM into a "structural engineer" with a systematic training framework that enforces format and syntax constraints.

Technical Solution: Three-Stage Training + Format-Constrained GRPO

FireRed-OCR isn't a simple fine-tune - it's a complete training pipeline:

Model Architecture

Stage 1: Multi-Task Pre-Alignment

Build "spatial foundations" at the visual perception level. The model learns object detection, region recognition, and layout-to-Markdown mapping, establishing the groundwork for spatial localization.

Stage 2: Specialized Supervised Fine-Tuning (SFT)

Precision-tune on high-quality, standardized Markdown datasets to ensure logical consistency and hierarchical expression capabilities.

Stage 3: Format-Constrained Reinforcement Learning

The core innovation lies here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces dedicated format reward signals across four dimensions:

Dimension Reward Signal
Formula Syntax Correctness LaTeX validity
Table Structure Integrity Tag closure
Hierarchical Tag Closure Markdown nesting correctness
Text Accuracy Character-level recognition precision

Every output is scored across these four dimensions, with feedback sent to the model for self-correction.

Performance Benchmarks: What 92.94% Means

FireRed-OCR-2B's performance on OmniDocBench v1.5:

Model Overall Score Parameters Type
FireRed-OCR-2B 92.94% 2B End-to-End
Qwen3.5-397B 90.80% 397B End-to-End
Gemini-3.0 Pro 90.33% - End-to-End
DeepSeek-OCR 2 91.09% - End-to-End
GLM-OCR 94.60% - Pipeline
PaddleOCR-VL-1.5 94.50% 1.5B Pipeline

FireRed-OCR is the optimal end-to-end single model solution. PaddleOCR-VL-1.5 and GLM-OCR are pipeline approaches (multiple specialized models chained together) with higher scores but more complex deployment.

For text recognition alone (OCRBench TextRec), FireRed-OCR-2B scores 93.5, ranking first among all models, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).

In FireRedBench (the team's custom "stress test" benchmark featuring real-world non-standard documents), FireRed-OCR-2B scores 74.62, taking first place among end-to-end solutions, surpassing pipeline GLM-OCR (74.33), and trailing only PaddleOCR-VL-1.5 (76.47).

The base model Qwen3-VL-2B-Instruct scores only 65.58, showing significant improvement.

Deployment: Ready in Minutes

With 2B parameters and bfloat16 precision, memory usage is approximately 4-5GB. A single RTX 3090 / A10 GPU is sufficient for smooth inference.

Installation:

pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR

Inference Example:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "FireRedTeam/FireRed-OCR",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text)  # Standard Markdown format

Performance Optimization Tips:

  • Enable flash_attention_2 for significantly reduced peak memory and improved throughput
  • max_new_tokens defaults to 8192; for dense academic papers, maintain or increase this value
  • Image quality matters significantly; provide images ≥150 DPI for best results

Ideal Use Cases

FireRed-OCR excels at documents requiring structural integrity:

  • Academic papers (with formulas)
  • Financial reports and tables
  • Technical documentation
  • Multi-column book scans

For downstream tasks where "tables must not break" and "formulas must not err" are critical requirements, it's currently the most reliable end-to-end choice.

When Not to Use

  • For extreme precision requirements with engineering resources to maintain multi-model systems → Choose PaddleOCR-VL-1.5 or GLM-OCR
  • For very poor quality scans (<100 DPI) → Performance degrades significantly

Conclusion

FireRed-OCR demonstrates the power of "specialized optimization": through a carefully designed training framework, a 2B model beats 235B general-purpose models on vertical tasks.

For specialized tasks, targeted training is more efficient than scaling model size.

Resources