Melissa Dell Profile picture
Economics Professor @Harvard. Development economics, political economy, economic history, deep learning methods for data curation.
Sep 1, 2023 13 tweets 5 min read
I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13)
Paper:
Dataset: arxiv.org/abs/2308.12477
huggingface.co/datasets/dell-… We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13) Image
Apr 8, 2021 19 tweets 7 min read
(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter (2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.