When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Apr 29, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 2 views

#handwritten math ocr #vision-language models #educational ai #evaluation metrics #over-correction

⚡ TL;DR · AI summary

This paper examines how Vision-Language Models (VLMs) often over-correct errors when transcribing multi-line handwritten math solutions, compromising their use in educational assessment. The authors introduce PINK, a new evaluation metric that penalizes such over-correction by using LLM-based rubric grading. Unlike traditional lexical metrics like BLEU, PINK better aligns with human judgment and reveals significant performance differences among state-of-the-art VLMs. On the FERMAT dataset, models like GPT-4o are heavily penalized for fixing student errors, while Gemini 2.5 Flash ranks highest for transcription fidelity.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Computers and Society arXiv:2604.22774 (cs) [Submitted on 1 Apr 2026] Title:When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR Authors:Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim View a PDF of the paper titled When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR, by Jin Seong and 4 other authors View PDF HTML (experimental) Abstract:Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Discussion

More from arXiv cs.AI