WeSearch

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

·3 min read · 0 reactions · 0 comments · 2 views
#handwritten math ocr#vision-language models#educational ai#evaluation metrics#over-correction
When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
⚡ TL;DR · AI summary

This paper examines how Vision-Language Models (VLMs) often over-correct errors when transcribing multi-line handwritten math solutions, compromising their use in educational assessment. The authors introduce PINK, a new evaluation metric that penalizes such over-correction by using LLM-based rubric grading. Unlike traditional lexical metrics like BLEU, PINK better aligns with human judgment and reveals significant performance differences among state-of-the-art VLMs. On the FERMAT dataset, models like GPT-4o are heavily penalized for fixing student errors, while Gemini 2.5 Flash ranks highest for transcription fidelity.

Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Computers and Society arXiv:2604.22774 (cs) [Submitted on 1 Apr 2026] Title:When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR Authors:Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim View a PDF of the paper titled When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR, by Jin Seong and 4 other authors View PDF HTML (experimental) Abstract:Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI