Skip to content

opendatalab/MinerU

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

MinerU is an open-source Python tool from OpenDataLab that converts complex PDFs and Office documents into clean markdown or JSON for LLM and RAG workflows. It preserves tables, formulas, and document hierarchy using layout analysis and OCR -- producing output LLMs can process accurately in 2026.

64,186 stars5,424 forksPythonUpdated May 2026
✅ Reviewed by My AI Guide, vetted for vibe builders

Our Review

OpenDataLab released MinerU in 2024 to solve a parsing problem that every team building RAG pipelines eventually hits: standard PDF text extraction produces poor output for complex documents. A research paper with equations, multi-column layouts, figures, and tables becomes a disorganized block of text when processed with naive PDF parsers -- the structure that makes the information useful is destroyed at extraction time. MinerU applies layout analysis before extraction, treating the document as a structured object: equations become LaTeX, tables become markdown tables, figures are identified with their captions, and headings preserve the document hierarchy.

Key capabilities

  • Layout-aware PDF parsing: identifies columns, headings, figures, and tables before extraction -- not a raw text dump but a structured reconstruction
  • Formula extraction: mathematical equations are converted to LaTeX notation rather than garbled character sequences
  • Table parsing: complex multi-row, multi-column tables become clean markdown tables or JSON arrays preserving cell relationships
  • OCR for scanned documents: multi-language optical character recognition handles documents without embedded text layers
  • Office format support: processes DOCX, PPTX, and XLSX in addition to PDF, all to consistent markdown or JSON output
  • LLM-ready output: clean markdown and structured JSON designed for direct ingestion into RAG pipelines, vector databases, and LLM context

Getting started

pip install mineru. Convert a PDF with mineru convert document.pdf --output output/ or use the Python API: from mineru import Pipeline; result = Pipeline().run("document.pdf"). Output options include markdown, JSON, and a structured extraction with figure references and page metadata.

Limitation

Heavy model dependencies -- MinerU uses deep learning models for layout analysis and formula extraction that require several GB of downloads on first run. Processing speed is slower than naive text extraction, typically 2-10 seconds per page depending on document complexity and hardware. Very long documents (500+ pages) may require batching. GPU acceleration is optional but significantly improves processing speed for large document batches.

Our Verdict

MinerU's 63,000 GitHub stars reflect a genuine infrastructure gap in the AI ecosystem. Every RAG pipeline needs a document ingestion stage, and the quality of that stage determines the quality of retrieval. Naive PDF parsing that loses table structure, corrupts equations, and discards document hierarchy produces retrieval chunks that are syntactically present but semantically degraded. MinerU's layout-analysis-first approach produces output that better represents the original document's information structure.

The formula and table extraction capabilities are where MinerU pulls ahead of simpler tools most clearly. For technical domains -- scientific papers, financial reports, technical documentation -- these are the content types where naive extraction fails most visibly. Converting a multi-column academic paper with equations and tables to usable markdown is a task that previously required manual cleanup or expensive commercial APIs; MinerU handles it reliably in 2026.

The practical constraints are compute and speed. The deep learning models for layout analysis are not lightweight -- initial setup requires several gigabytes of model downloads, and processing is measured in seconds per page rather than milliseconds. For teams processing thousands of documents, GPU access and batching strategies are necessary for practical throughput. For teams with moderate document volumes and quality requirements, the tradeoff is clear.

Frequently Asked Questions

What is MinerU and why is it better than standard PDF text extraction?

MinerU is an open-source Python tool that converts complex PDFs and Office documents into clean markdown or JSON for LLM and RAG workflows. Standard PDF text extraction reads embedded character data and produces disorganized text that loses table structure, corrupts equations, and discards document hierarchy. MinerU applies layout analysis first -- identifying columns, headings, tables, and figures -- and then reconstructs the document as structured markdown that preserves the original information organization. The difference is most visible on academic papers, financial reports, and technical documentation in 2026.

Which document formats does MinerU support?

MinerU supports PDF (including scanned PDFs via OCR), DOCX, PPTX, and XLSX. All formats produce consistent markdown or structured JSON output. PDF processing includes layout analysis, formula extraction, table parsing, and figure identification. For DOCX and PPTX, it preserves heading structure, table formatting, and embedded content. XLSX conversion produces structured table representations. MinerU is actively developed and format support continues to expand with each release in 2026.

How does MinerU handle mathematical equations and formulas?

MinerU uses a dedicated formula detection model that identifies equation regions in PDFs and converts them to LaTeX notation. This covers both inline equations embedded in text and display equations set on their own lines. The LaTeX output can be rendered by downstream systems or passed directly to LLMs that understand LaTeX notation. For PDFs where equations are rendered as images rather than text (common in older papers), MinerU applies OCR-based formula recognition to reconstruct the LaTeX in 2026.

Does MinerU require a GPU?

GPU is optional but significantly improves performance. On CPU, MinerU processes 2-10 seconds per page depending on document complexity. With a CUDA-capable GPU, processing speed improves 3-8x on typical documents, which matters at scale. For occasional document processing, CPU is sufficient. For processing thousands of documents in batch (building a large RAG corpus), GPU access on cloud infrastructure becomes practically necessary for reasonable throughput in 2026.

How does MinerU compare to commercial PDF extraction APIs?

Commercial PDF extraction APIs (AWS Textract, Azure Document Intelligence, Google Document AI) provide managed infrastructure, SLA guarantees, and simpler setup at the cost of per-page fees (typically $0.001-0.015 per page). MinerU is free to run but requires GPU infrastructure, initial model downloads, and operational maintenance. For teams processing millions of pages, MinerU's compute cost is far lower than commercial API fees. For teams processing occasional documents without GPU infrastructure, commercial APIs may be faster to operationalize in 2026.

What is MinerU?

MinerU is an open-source Python tool from OpenDataLab that converts complex PDFs and Office documents into clean markdown or JSON for LLM and RAG workflows. It preserves tables, formulas, and document hierarchy using layout analysis and OCR -- producing output LLMs can process accurately in 2026.

How do I install MinerU?

Visit the GitHub repository at https://github.com/opendatalab/MinerU for installation instructions.

What license does MinerU use?

MinerU uses the AGPL-3.0 license.

What are alternatives to MinerU?

Explore related tools and alternatives on My AI Guide.

🔒

Open source & community-verified

AGPL-3.0 licensed: free to use in any project, no strings attached. 64,186 developers have starred this, meaning the community has reviewed and trusted it.

Reviewed by My AI Guide for relevance, quality, and active maintenance before listing.

Topics

pdfocrlayout-analysisdocument-analysispdf-parserpdf-extractor-llmpdf-extractor-ragpythondocxpptxxlsx

Related Tools

View all