Latest Twitter Threads by @VikParuchuri on Thread Reader App

Aug 12 • 11 tweets • 4 min read

Parsing PDFs has slowly driven me insane over the last year. Here are 8 weird edge cases to show you why PDF parsing isn't an easy problem. 🧵

PDFs have a font map that tells you what actual character is connected to each rendered character, so you can copy/paste. Unfortunately, these maps can lie, so the character you copy is not what you see. If you're unlucky, it's total gibberish.

Apr 24 • 4 tweets • 2 min read

We shipped an alpha version of the new Surya OCR model. No hype, just facts:

- 90+ languages (focus on en, romance langs, zh, ar, ja, ko)
- LaTeX and formatting
- Char/word/line bboxes
- ~500M non-embed params
- 10-20 pages/s

Get with `pip install --pre -U surya-ocr`. Use with marker: `pip install --pre -U marker-pdf`, then pass the `format_lines` marker option.

More info at github.com/VikParuchuri/s… and github.com/VikParuchuri/m… .

Feb 19 • 8 tweets • 3 min read

We've improved marker (PDF -> markdown) a lot in 3 months - accuracy and speed now beat llamaparse, mathpix, and docling.

We shipped:
- llm mode that augments marker with models like gemini flash
- improved math, w/inline math
- links and references
- better tables and forms

Find marker at

Benchmarking markdown conversion isn't easy - different services have different formats. We use both a heuristic text matching method, and llm as a judge.

The code for the benchmarks is in the marker repo. github.com/VikParuchuri/m…

Nov 27, 2024 • 8 tweets • 3 min read

Marker v1 is out! This is a complete rewrite - 2x faster, much more accurate, easy to extend, with markdown + JSON chunk output.

Just run `pip install -U marker-pdf`.

Find it at .

Marker v1 does layout and order in one step, which turns three model calls into one. The layout model handles more block types, like code and lists, that were tricky before. github.com/VikParuchuri/m…

Oct 15, 2024 • 6 tweets • 3 min read

I made a library to detect tables and extract to markdown or csv. It uses a new table recognition model I trained.

Find it here - github.com/VikParuchuri/t…

Aug 16, 2024 • 11 tweets • 5 min read

Announcing Surya OCR 2! It uses a new architecture and improves on v1 in every way:

- OCR with automatic language detection for 93 languages (no more specifying languages!)
- More accurate on old/noisy documents
- 20% faster
- Basic English handwriting support

Find Surya here - .

Surya OCR 2 is more accurate across all document types. It also compares favorably to Tesseract and Google Cloud OCR. The benchmarking script is in the repo.

Language is not hinted to Surya 2 for these benchmarks. github.com/VikParuchuri/s…

Jul 12, 2024 • 4 tweets • 2 min read

I just released new surya layout and text detection models:

- 30% faster on GPU, 4x faster on CPU, 12x faster on MPS
- Accuracy very slightly better
- When I merge this into marker, it will be 15% faster on GPU, 3x on CPU, 7x on MPS

I used a modified version of efficientvit from MIT - - which was then adapted by @wightmanr . I made some small modifications, including adding a segmentation head. Thanks for much for the architecture/code!github.com/mit-han-lab/ef…

Jan 12, 2024 • 13 tweets • 4 min read

Announcing surya - a multilingual text line detection model for documents. It gives you accurate line-level bboxes and column breaks.

Find it here - . github.com/VikParuchuri/s…

Surya was trained on a diverse set of documents, including scientific papers. It works with every language that I've tried.

It should work with good quality scanned documents as well due to image augmentation.

Nov 30, 2023 • 22 tweets • 6 min read

I'm excited to ship marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data.

Find it here - . github.com/VikParuchuri/m…

Nougat is an amazing model, but is slow and hallucination-prone (1.5% of pages in arXiv, 5%+ outside) due to autoregressive decoding.

Marker converts and cleans text incrementally. It uses 4 models - column detector, layout detector, nougat, postprocessor. It OCRs if needed.

Oct 15, 2019 • 16 tweets • 4 min read

1/ In this thread, I'll discuss @LambdaSchool, a bootcamp that charges 17% of your pre-tax income for up to 2 years (ISA).

tl;dr Lambda is much more expensive than the average bootcamp, and has similar outcomes. 75% of Lambda students could pay an avg of $9k less elsewhere. 2/ First, outcomes.

85.9% of Lambda graduates get a job within 180 days, with a median 60k salary.

A survey across multiple bootcamps found that 79% of all bootcamp grads were employed within 120 days, with a median 65k salary.

Share this page!

Enter URL or ID to Unroll