Extract Text from Documents and Images with Datalab Marker and OCR

1 просмотров Источник

Datalab has introduced state-of-the-art document parsing and text extraction models now available on Replicate. The Marker tool converts PDF, DOCX, PPTX files, and images into markdown or JSON format. It formats tables, math, and code, extracts images, and can pull specific fields when a JSON schema is provided. OCR detects text in ninety languages from images and documents, returning reading order and table grids.

The Marker model is based on a popular open-source project with 29k stars on GitHub, while OCR is based on the Surya project with 19k stars. Both models are fast and accurate, outperforming established tools like Tesseract with short processing times. Marker processes a page in about 0.18 seconds and can handle 120 pages per second when batched.

One particularly powerful feature of Marker is structured extraction. For instance, specific fields from an invoice can be extracted using a JSON schema. Marker’s performance was evaluated using the olmOCR-Bench benchmark, which includes 1,403 PDF files with 7,010 test cases assessing the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.

Marker outperformed all tested models, including GPT-4o, Deepseek OCR, and Mistral OCR. In terms of pricing, Marker costs $4 per 1000 pages in fast and balanced modes, $6 per 1000 pages for structured extraction, and $6 per 1000 pages in accurate mode. OCR is priced at $2 per 1000 pages.

Похожие статьи