Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Vik Paruchuri

@VikParuchuri

Aug 12 • 11 tweets • 4 min read • Read on X

Scrolly

Parsing PDFs has slowly driven me insane over the last year. Here are 8 weird edge cases to show you why PDF parsing isn't an easy problem. 🧵

PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.

PDFs can have invisible text that only shows up when you try to extract it. "Measurement in your home" is only here once...or is it?

Math is a whole can of worms. Remember the font map problem? Well, math is almost always random characters - here we get some strange Tamil/Amharic combo.

Math bounding boxes are always fun - see how each formula is broken up into lots of tiny sections? Putting them together is a great time!

Once upon a time, someone decided that their favorite letters should be connected together into one character - like ffi or fl. Unfortunately, PDFs are inconsistent with this, and sometimes will totally skip ligatures - very ecient of them.

Not all text in a PDF is correct. Some PDFs are digital, and the text was added on creation. But others have had invisible OCR text added, sometimes based on pretty bad text detection. That's when you get this mess:

Overlapping text elements can get crazy - see how the watermark overlaps all the other text? Forget about finding good reading order here.

I've been showing you somewhat nice line bounding boxes. But PDFs just have character positions inside - you have to postprocess to join them into lines. In tables, this can get tricky, since it's hard to know when a new cell starts:

You might be wondering why you should even bother with the text inside PDFs. The answer is that a lot of PDFs have good text, and it's faster and more accurate to just pull it out.

This is what we do with marker - - we only OCR if the text is bad.github.com/datalab-to/mar…

Anyways, back to fixing more crazy edge cases. Let me know if you've come across any other PDF weirdness.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @VikParuchuri

Vik Paruchuri

@VikParuchuri

Apr 24

We shipped an alpha version of the new Surya OCR model. No hype, just facts:

- 90+ languages (focus on en, romance langs, zh, ar, ja, ko)
- LaTeX and formatting
- Char/word/line bboxes
- ~500M non-embed params
- 10-20 pages/s

Get with `pip install --pre -U surya-ocr`. Use with marker: `pip install --pre -U marker-pdf`, then pass the `format_lines` marker option.

More info at github.com/VikParuchuri/s… and github.com/VikParuchuri/m… .

Throughput measured using 1x H100 with Nvidia MPS enabled, 10 workers, and chunking. Finalizing a VLLM config for improved performance. (arch is mostly llama/qwen, but some non-standard stuff).

Read 4 tweets

Vik Paruchuri

@VikParuchuri

Feb 19

We've improved marker (PDF -> markdown) a lot in 3 months - accuracy and speed now beat llamaparse, mathpix, and docling.

We shipped:
- llm mode that augments marker with models like gemini flash
- improved math, w/inline math
- links and references
- better tables and forms

Find marker at

Benchmarking markdown conversion isn't easy - different services have different formats. We use both a heuristic text matching method, and llm as a judge.

The code for the benchmarks is in the marker repo. github.com/VikParuchuri/m…

LLM mode iterates on marker output for certain blocks. You can use gemini, or local models via ollama. More models coming soon.

Marker + llms is faster and hallucination-free vs using llms alone. Here marker + gemini flash beats gemini flash alone on a fintabnet benchmark.

Read 8 tweets

Vik Paruchuri

@VikParuchuri

Nov 27, 2024

Marker v1 is out! This is a complete rewrite - 2x faster, much more accurate, easy to extend, with markdown + JSON chunk output.

Just run `pip install -U marker-pdf`.

Find it at .

Marker v1 does layout and order in one step, which turns three model calls into one. The layout model handles more block types, like code and lists, that were tricky before. github.com/VikParuchuri/m…

The code is modular, with a consistent internal schema. It's easy to extend with your logic. Data comes in via providers, processors operate on individual blocks, and output is generated through renderers. You can override any part of the system.

Read 8 tweets

Vik Paruchuri

@VikParuchuri

Oct 15, 2024

I made a library to detect tables and extract to markdown or csv. It uses a new table recognition model I trained.

Find it here - github.com/VikParuchuri/t…

Table extraction is a task frontier LLMs have trouble with; this is gemini flash extracting the first table. Columns are added, mixed up, and values hallucinated.

Read 6 tweets

Vik Paruchuri

@VikParuchuri

Aug 16, 2024

Announcing Surya OCR 2! It uses a new architecture and improves on v1 in every way:

- OCR with automatic language detection for 93 languages (no more specifying languages!)
- More accurate on old/noisy documents
- 20% faster
- Basic English handwriting support

Find Surya here - .

Surya OCR 2 is more accurate across all document types. It also compares favorably to Tesseract and Google Cloud OCR. The benchmarking script is in the repo.

Language is not hinted to Surya 2 for these benchmarks. github.com/VikParuchuri/s…

My earlier benchmark compared mainly clean documents, so I made a new noisy document benchmark to compare v2 and v1. This was created from tapuscorpus by @Alix_Tz. Again, language is not hinted.

Read 11 tweets

Vik Paruchuri

@VikParuchuri

Jul 12, 2024

I just released new surya layout and text detection models:

- 30% faster on GPU, 4x faster on CPU, 12x faster on MPS
- Accuracy very slightly better
- When I merge this into marker, it will be 15% faster on GPU, 3x on CPU, 7x on MPS

I used a modified version of efficientvit from MIT - - which was then adapted by @wightmanr . I made some small modifications, including adding a segmentation head. Thanks for much for the architecture/code!github.com/mit-han-lab/ef…

I didn't change the training data much, but the new models do allow for higher resolution (since there's no global softmax attention), so benchmark scores are slightly better.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Vik Paruchuri

Try unrolling a thread yourself!

More from @VikParuchuri

Vik Paruchuri

Vik Paruchuri

Vik Paruchuri

Vik Paruchuri

Vik Paruchuri

Vik Paruchuri

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!