Extract Text from Documents and Images with Datalab Marker and OCR
Datalab has introduced state-of-the-art document parsing and text extraction models now available on Replicate. The Marker tool converts PDF, DOCX, PPTX files, and images into markdown or JSON format. It formats tables, math, and code, extracts images, and can pull specific fields when a JSON schema is provided. OCR detects text in ninety languages from images and documents, returning reading order and table grids.
The Marker model is based on a popular open-source project with 29k stars on GitHub, while OCR is based on the Surya project with 19k stars. Both models are fast and accurate, outperforming established tools like Tesseract with short processing times. Marker processes a page in about 0.18 seconds and can handle 120 pages per second when batched.
One particularly powerful feature of Marker is structured extraction. For instance, specific fields from an invoice can be extracted using a JSON schema. Marker’s performance was evaluated using the olmOCR-Bench benchmark, which includes 1,403 PDF files with 7,010 test cases assessing the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.
Marker outperformed all tested models, including GPT-4o, Deepseek OCR, and Mistral OCR. In terms of pricing, Marker costs $4 per 1000 pages in fast and balanced modes, $6 per 1000 pages for structured extraction, and $6 per 1000 pages in accurate mode. OCR is priced at $2 per 1000 pages.
Create pixel art with Retro Diffusion models on Replicate
Explore Together AI Innovations at NVIDIA GTC 2026
Похожие статьи
Optimize UC Berkeley's Machine Learning Course for the AI Age
UC Berkeley updates its machine learning course to help students adapt to changes in the tech industry.
Together AI Enhances Fine-Tuning Service with Tool Support
Together AI expands its fine-tuning service by adding support for tool calls and reasoning.
Introducing DSGym: A New Framework for Evaluating Data Science Agents
DSGym is a new framework for evaluating and training data science agents, offering standardized solutions.