Parsing PDFs has slowly driven me insane over the last year. Here are 8 weird edge cases to show you why PDF parsing isn't an easy problem. 🧵
PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.
PDFs can have invisible text that only shows up when you try to extract it. "Measurement in your home" is only here once...or is it?
Math is a whole can of worms. Remember the font map problem? Well, math is almost always random characters - here we get some strange Tamil/Amharic combo.
Math bounding boxes are always fun - see how each formula is broken up into lots of tiny sections? Putting them together is a great time!
Once upon a time, someone decided that their favorite letters should be connected together into one character - like ffi or fl. Unfortunately, PDFs are inconsistent with this, and sometimes will totally skip ligatures - very ecient of them.
Not all text in a PDF is correct. Some PDFs are digital, and the text was added on creation. But others have had invisible OCR text added, sometimes based on pretty bad text detection. That's when you get this mess:
Overlapping text elements can get crazy - see how the watermark overlaps all the other text? Forget about finding good reading order here.
I've been showing you somewhat nice line bounding boxes. But PDFs just have character positions inside - you have to postprocess to join them into lines. In tables, this can get tricky, since it's hard to know when a new cell starts:
You might be wondering why you should even bother with the text inside PDFs. The answer is that a lot of PDFs have good text, and it's faster and more accurate to just pull it out.
Let's explore the internals of the PDF format to figure out how Adobe did this to us.
I created this amazing sample PDF with reportlab.
We can use `strings sample.pdf | less` to extract/render the ascii text from the PDF binary.
This is the whole PDF! The `8 0 obj` lines define different objects (you can see they refer to each other). The xref is a quick way to jump to the byte position of an object.
High quality math is the secret sauce for reasoning models.
The best math data is in old papers. But OCRing that math is full of insane edge cases.
Let's talk about how to solve this, and how you can get better math data than many frontier labs 🧵
Many papers, including deepseek-math and nvidia nemotron, have concluded that math pretraining is critical for general LLM reasoning:
Quality matters! The smollm paper from huggingface found that filtering math tokens from 34B to 10B (keeping the highest quality tokens) and training with fixed steps increased reasoning performance. (finemath4+ vs 3+)
Throughput measured using 1x H100 with Nvidia MPS enabled, 10 workers, and chunking. Finalizing a VLLM config for improved performance. (arch is mostly llama/qwen, but some non-standard stuff).
We've improved marker (PDF -> markdown) a lot in 3 months - accuracy and speed now beat llamaparse, mathpix, and docling.
We shipped:
- llm mode that augments marker with models like gemini flash
- improved math, w/inline math
- links and references
- better tables and forms
Find marker at
Benchmarking markdown conversion isn't easy - different services have different formats. We use both a heuristic text matching method, and llm as a judge.
Marker v1 is out! This is a complete rewrite - 2x faster, much more accurate, easy to extend, with markdown + JSON chunk output.
Just run `pip install -U marker-pdf`.
Find it at .
Marker v1 does layout and order in one step, which turns three model calls into one. The layout model handles more block types, like code and lists, that were tricky before. github.com/VikParuchuri/m…
The code is modular, with a consistent internal schema. It's easy to extend with your logic. Data comes in via providers, processors operate on individual blocks, and output is generated through renderers. You can override any part of the system.
Table extraction is a task frontier LLMs have trouble with; this is gemini flash extracting the first table. Columns are added, mixed up, and values hallucinated.