1/ With @LC_Labs, #NDNP, and @dsweld, I’m excited to share the Newspaper Navigator search app: train your own AI navigators to search over 1.5 million historic newspaper photos by visual similarity! (desktop viewing recommended) #ChronAm
2/ For the first phase of my @librarycongress Innovator in Residence project, I created the #NewspaperNavigator dataset: extracted visual content from 16+ million newspaper pages in #ChronAm. This search app provides new ways of searching the dataset.
3/ In addition to supporting keyword search, the Newspaper Navigator app leverages image embeddings to support on-the-fly training and re-training of AI navigators in just a couple of seconds. Here is an example of training an AI navigator to retrieve photos of sailboats:
4/ As with the Newspaper Navigator dataset and #ChronAm, the full search app is in the public domain, from the photos to the code. You can find all code for Newspaper Navigator at this GitHub repo: github.com/LibraryOfCongr…
5/ Because machine learning has a fraught history with perpetuating marginalization, I wrote a data archaeology investigating the ways in which machine learning mediates our interactions with the visual content in the Newspaper Navigator dataset and app.
6/ In the data archaeology, I study the digitization journeys of 4 reproductions of the same photo of W.E.B. Du Bois, as found in the pages of 3 Black newspapers in #ChronAm.
1/ Announcing GovScape – multimodal search for 10 million government PDFs (70 million pages) from the End of Term Web Archive! GovScape offers visual search, semantic textual search, and keyword search.
2/ GovScape is built on top of the End of Term Web Archive () and contains all renderable PDFs of length 50 pages or fewer from the 2020 crawl, documenting the first Trump administration. An overview of GovScape’s search functionality can be found here: eotarchive.org
3/ The GovScape pre-processing pipeline ingests PDFs, renders them, generates CLIP and BGE embeddings of individual pages, and indexes the full text. The total compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500.