(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter
(2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.
(3/n) We are releasing an open-source deep-learning powered library, Layout Parser, that provides a variety of tools for automatically processing document image data at scale.

Webpage: layout-parser.github.io
Arxiv: arxiv.org/abs/2103.15348
Github: github.com/Layout-Parser/…
(4/n) Contrast the off-the-shelf OCR with the layout detection results we achieve through Layout Parser’s deep learning powered pipelines.
(5/n) We are currently using Layout Parser to process millions of such documents
(6/n) Layout Parser is not just for English. Here’s another example, a complex historical table from Japan
(7/n) These are the Layout Parser functionalities
(8/n) Layout Parser currently has some pre-trained models, and the pipelines for the above examples will be integrated when finalized. We are working to expand the types of documents it can process off-the-shelf
(9/n) With Layout Parser, you can train your own customized DL-based layout models. Because our pre-trained model zoo is currently small, right now Layout Parser is mostly useful for designing your own customized models
(10/n) Don’t have labeled data? Layout Parser incorporates a data annotation toolkit that makes it more efficient to create labeled data.
(11/n) Amongst its varied functionalities is a perturbation-based scoring method to select the most informative samples to label arxiv.org/abs/2010.01762
(12/n) Layout Parser builds wrappers to call OCR engines and comes with a DL-based CNN-RNN
(13/n) Layout Parser provides a flexible output structure to facilitate diverse downstream analyses.
(14/n) Layout Parser is implemented with simple APIs and can perform off-the-shelf layout analysis with four lines of Python code
(15/n) No background in deep learning? I’m teaching a new course this semester on deep learning for data curation at scale. I’ll be putting the course material into a public knowledgebase. I’ll post here when this is released (sometime in the next 1-2 months).
(16/n) We hope to make substantial innovations. With more resources we can expand the pre-trained model zoo significantly. Ultimately, we hope to convert the library into a user-friendly online platform that can be used by anyone, regardless of Python literacy or hardware.
(17/n) Building this takes a ton of work and financial resources. We’ve been invited to the final round of a large grant competition that would significantly expand Layout Parser, but we need to show there is demand for this from the social science community.
(18/n) If Layout-Parser seems relevant to your work, please consider taking less than a minute to visit our website: layout-parser.github.io. If you are on Github, take two seconds to star our repo: github.com/Layout-Parser/…. This will help us demonstrate crucial community support.
(19/n) Layout Parser contributors: @_shannon_shen, @ruochenxD, @MelissaLDell, @lee_bcg, @J_S_Carlson, Weining Li. Currently working with @qlquanle, @pquerubo, @LeanderHeldring, @krishna_econ, Sahar Parsa, and awesome RAs on additional models that will be added when complete.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Melissa Dell

Melissa Dell Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!