From Pixels to Prescriptions: Engineering OCR Pipelines for Medical Report Simplification Using MongoDB
The article details the development of an OCR pipeline called ocr.py for MediSimplify, a system that converts complex medical reports into patient-friendly language. The pipeline intelligently handles diverse document types by first attempting to extract embedded text from PDFs and resorting to OCR only when necessary, improving speed and accuracy. It includes robust error handling, whitespace normalization, and optimal DPI rendering to ensure reliable text extraction in real-world conditions. The processed text is then simplified using an AI model and stored in MongoDB.
Full article excerpt tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3901651) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Kotha Deepak Reddy Posted on Apr 28 From Pixels to Prescriptions: Engineering OCR Pipelines for Medical Report Simplification Using MongoDB #ai #ocr #medical #mongodb Team Members @k_sidharthareddy_15 | @k-deepak-544 | @nupur_madhrey_07 | @avika_kashyap | @dheerajkumar08 | @chanda_rajkumar Introduction So here's the thing — when We started working on MediSimplify, a project that takes medical reports and converts them into patient-friendly language, We thought the hard part would be the NLP simplification. Turns out, just getting the text out of the document was already a mini-nightmare. Medical reports come as everything: clean PDFs, scanned images, ancient faxed documents that someone scanned and emailed. OCR tools are finicky. Tesseract might not be installed on the deployment machine. A "PDF" might be a text-selectable document or a rasterized scan — and you can't tell which until you open it. We needed something that handled all of this gracefully, without crashing or silently returning garbage. This post walks through how we built ocr.py — the dedicated OCR service layer inside MediSimplify — and the specific decisions that made it actually reliable in a messy real-world setting. The Problem Medical documents are inconsistent by nature. They arrive in formats that no single extraction strategy can handle cleanly. Images (JPG/PNG) always need OCR. PDFs might have selectable text embedded, or they might be 300 DPI scans of printed pages — you don't know until you try. And raw OCR output is noisy: double spaces, broken newlines, garbled characters everywhere. That noise degrades everything downstream, especially the simplification model. On top of that, Tesseract isn't guaranteed to be installed wherever the backend runs. If you just call pytesseract.image_to_string() directly, any user who hasn't configured Tesseract will see a cryptic Python exception — not useful at all. Always try the cheap path first. Embedded PDF text extraction is instant and perfect quality. OCR is slow and error-prone. Only call Tesseract when you have to — and when you do, render pages at proper DPI so the results are actually good. Our Solution Rather than scattering OCR logic across the codebase, we built a single ocr.py service that all file upload endpoints call through. It has one job: accept raw bytes, return clean text. Here's what it does: Automatically resolves Tesseract's path from config or system PATH Raises a clear, user-readable error when OCR is unavailable Tries embedded PDF text first (fast path via PyMuPDF) Falls back to Tesseract OCR only for scanned pages Normalises whitespace so downstream NLP gets clean input Tech Stack Frontend: Next.js (App Router) + TypeScript + Tailwind CSS Backend: FastAPI + PyMongo Database: MongoDB OCR: pytesseract + PyMuPDF AI Simplification: flan-t5-small transformer with medical-term fallback Auth: JWT (login/signup/logout) Key Features 1. Resolve Tesseract's path from config or system PATH Tesseract can be installed anywhere — a system binary, a virtualenv, a custom path set by ops. Rather than hardcoding where to look, the service checks your config first, then falls back to a system path search.…
This excerpt is published under fair use for community discussion. Read the full article at DEV Community.