It’s no wonder then that social scientists are increasingly interested in narratives -- the stories we tell in fiction, politics, and life -- and how they shape beliefs, behavior, and government policies.
Narratives are obscure to social scientists because they consist of information, so the physical manifestations are spoken or written language.
More specifically, a narrative is an “account of a series of events, facts, etc., given in order and with the establishing of connections between them” (@OED).
Yet existing text-as-data approaches do not account for "who" does "what" to "whom".
We provide an approach for extracting narratives from text.
First, we use semantic role labeling (@ai2_allennlp) to extract the semantic roles of agent, verb, and patient. The agent is the entity that performs an action, while the patient is the entity acted upon.
The set of agents and patients is high-dimensional (typically millions of plain-text phrases).
We use named entity recognition (@spacy_io) to identify specific individuals and organizations. The remaining phrases are embedded (@gensim_py) and then clustered (@scikit_learn).
The resulting unsupervised pipeline takes in a plain-text corpus and outputs interpretable narratives representing the core claims.
In the paper, we construct narratives from floor speeches in U.S. Congress.
Some narratives are simple (e.g. “immigrants steal jobs”), but others are complex and interconnected.
We use a graph-based approach to build networks of connected entities, representing the larger narrative structures — or worldviews — expressed in a corpus.
Special thanks to @AndreiPlamada and @ETH_SIS for indispensable contributions to the package!
In the paper, we apply the method to over a million speeches given in U.S. Congress for the period 1994-2015. We show dynamics, sentiment, and partisanship in the narratives.
In particular, we show the most divisive policy narratives.
For example, “Oil”: Democrats say “oil makes profit” while Republicans say “oil creates jobs”.
Or “Jobs”: Democrats say “companies ship jobs” while Republicans say “taxes kill jobs”.
Section 4 discusses the potential and limitations of the approach. One thing we are excited about is how ʀᴇʟᴀᴛɪᴏ could be used to support qualitative analysis of narratives, not just in social science but also in history and the humanities.
Feedback welcome!
A special shout-out to teammates @phinifa and @PinchOfData, talented upcoming economists, grand co-authors, and a delight to work with.
"Extractive summarization" means excerpting the most important passages from a long document.
That is preferred to "abstractive summarization" (which paraphrases the documents in shorter form) for technical fields like law, where the wording really matters.
Methods covered in the "Text Algorithms" section: 1) preliminaries/ pre-processing 2) bag-of-words representation of docs 3) dimensionality reduction, e.g. topic models 4) word embeddings with local context 5) embedding sequences with attention 6) supervised learning
In taking these methods to economics, we organize things around "Four Measurement Problems":
We use computational linguistics tools ("word embeddings") to map out a dimension for emotion on one pole and cognition on another pole.
The resulting geometric emotion scale is continuous and doesn't rely on the presence of particular words. In a human validation where annotators ranked pairs of sentences as more or less emotive, our metric agreed with human judgment much more often than a word-based measure.