Latest Twitter Threads by @MelissaLDell on Thread Reader App

Dec 22, 2023 • 4 tweets • 2 min read

I’m excited to share News Déjà Vu (), which uses a custom large language model to retrieve historical news articles that are the most similar to modern news articles. (1/4) newsdejavu.github.io

We first mask out all named entities (e.g. people, locations, organizations). The language model, trained to capture semantic similarity, then maps each news article to a vector. For a given modern news article, we choose the closest historical article in this vector space. (2/4)

Sep 1, 2023 • 13 tweets • 5 min read

I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13)
Paper:
Dataset: arxiv.org/abs/2308.12477
huggingface.co/datasets/dell-… We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13)

Apr 8, 2021 • 19 tweets • 7 min read

(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter (2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.

Share this page!

Enter URL or ID to Unroll