How We Built a Custom Vision LLM to Improve Document Processing at Grab
Grab developed a specialized Vision LLM pipeline to improve OCR and key information extraction for Southeast Asian documents. They evaluated Qwen2-VL 2B, generated synthetic OCR data and an auto-labeling/preprocessing platform (Documint), experimented with LoRA and full fine-tuning, then built a custom ~1B model by combining a strong vision encoder with a compact decoder. The custom model achieves near-2B accuracy with much lower latency and improved performance on Thai and Vietnamese scripts.