Economics Professor @Harvard. Development economics, political economy, economic history, deep learning methods for data curation.
Sep 1, 2023 • 13 tweets • 5 min read
I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13)
Paper:
Dataset: arxiv.org/abs/2308.12477 huggingface.co/datasets/dell-…
We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13)
Apr 8, 2021 • 19 tweets • 7 min read
(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter
(2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.