When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
This paper examines how Vision-Language Models (VLMs) often over-correct errors when transcribing multi-line handwritten math solutions, compromising their use in educational assessment. The authors introduce PINK, a new evaluation metric that penalizes such over-correction by using LLM-based rubric grading. Unlike traditional lexical metrics like BLEU, PINK better aligns with human judgment and reveals significant performance differences among state-of-the-art VLMs. On the FERMAT dataset, models like GPT-4o are heavily penalized for fixing student errors, while Gemini 2.5 Flash ranks highest for transcription fidelity.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computers and Society arXiv:2604.22774 (cs) [Submitted on 1 Apr 2026] Title:When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR Authors:Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim View a PDF of the paper titled When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR, by Jin Seong and 4 other authors View PDF HTML (experimental) Abstract:Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.