Melissa Dell Profile picture
Economics Professor @Harvard. Development economics, political economy, economic history, deep learning methods for data curation.
Dec 22, 2023 4 tweets 2 min read
I’m excited to share News Déjà Vu (), which uses a custom large language model to retrieve historical news articles that are the most similar to modern news articles. (1/4) newsdejavu.github.ioImage We first mask out all named entities (e.g. people, locations, organizations). The language model, trained to capture semantic similarity, then maps each news article to a vector. For a given modern news article, we choose the closest historical article in this vector space. (2/4) Image
Sep 1, 2023 13 tweets 5 min read
I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13)
Paper:
Dataset: arxiv.org/abs/2308.12477
huggingface.co/datasets/dell-… We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13) Image
Apr 8, 2021 19 tweets 7 min read
(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter (2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.