The curation week is almost over and I would like to thank everyone for joining the discussions this week! Itโs been a blast ๐ฅณ
If you enjoyed this week, feel free to reach out on Twitter (@cosima_meyer) or GitHub (github.com/cosimameyer/) โจ
@cosima_meyer I feel very honored that I had the chance to talk with you about the things I enjoy doing and I cannot wait to learn more from the upcoming curators - the lineup looks amazing! ๐
@cosima_meyer If you missed a Twitter thread this week, head over to @pilizalde's amazing thread where she collected all of them (I love the GitHub emoji ๐บ)
@cosima_meyer@pilizalde But before I leave and head over to do some serious vinyl shopping ๐ถ, I want to talk about #NLP with you ๐ฌ
@cosima_meyer@pilizalde ๐ก What is NLP?
NLP is short for Natural Language Processing and it helps make sense of a difficult data type: written text.
@cosima_meyer@pilizalde ๐ But let's first start with the basic concepts and add a Gilmore Girls flavor to it - because who would be a better use case than those who have a seemingly endless vocabulary?
@cosima_meyer@pilizalde โจ Corpus: When you have your text data ready, you have your corpus. It's a collection of documents.
@cosima_meyer@pilizalde โจ Tokens: Define each word in a text (but it could also be a sentence, paragraph, or character).
@cosima_meyer@pilizalde โจ Tokenization: When you hear the word tokenization, it means that you are splitting up the sentences into single words (tokens) and turning them into a bag of words. You can take this quite literally - a bag of words does not really take the order of the words into account.
@cosima_meyer@pilizalde There are ways to account for the order using n-grams (so for instance a bigram would leave the sentence "Rory lives in a world of books" as "Rory lives", "lives in", "in a", "a world", "world of", "of books") but it's limited.
@cosima_meyer@pilizalde โจ Document-feature matrix (DFM): To generate the DFM you first split the text into its single terms (tokens), then count how frequently each token occurs in each document.
@cosima_meyer@pilizalde โจ Lemmatization: With lemmatization, it's slightly different. Instead of "stud" (which would probably be the stem of the study terms), you end up with a meaningful stem - "study" ๐ฅณ
@cosima_meyer@pilizalde The overview also describes a typical workflow with the bags-of-word approach nicely.
๐ tokenize it (and turn it into a bag full of words),
๐ pre-process it by stemming it (and removing stop words and a bit more)
@cosima_meyer@pilizalde ๐ count the single words to turn the count into a DFM (document-feature matrix) - and now you're ready to go! ๐
@cosima_meyer@pilizalde From here on, you can do multiple tasks - for instance, you can perform #supervised tasks with dictionary approaches or classify the sentiment or topics. But you can also use it to perform #unsupervised tasks like structural topic models. The possibilities are almost endless.
@cosima_meyer@pilizalde If you're up for more on how to use ๐ฆ {quanteda} in #rstats on these tasks, here is more from a hands-on workshop that I had the honor to give @RLadiesBergen
I deployed the code in a readable and downloadable #Rmd file for you to use ๐
@cosima_meyer@pilizalde@RLadiesBergen It contains everything from detailed terms and concepts description over data preparation to using supervised and unsupervised approaches.
And in my last Twitter thread, I wanted to talk with you about some powerful approaches in #NLP and how we can use both #rstats and #python to unleash them ๐ช
One possible downside when using the bag of words approach described before is that you often cannot fully take the structure of the language into account (n-grams are one way, but they are often limited).
You also often need many data to successfully train your model - which can be time-consuming and labor intensive. An alternative is to use a pre-trained model. And here comes @Google's famous deep learning model: BERT.
๐ก What is reactivity and what does it have to do with a carrier pigeon? ๐ฆ
To better understand how a #ShinyApp works, it's good to understand what's behind reactivity.
To describe it, I love the image of a carrier pigeon ๐ฆ (I picked up this idea when reading a post by @StatGarrett - so all credits go to him and all errors are mine โจ)
@StatGarrett What reactivity does is "a magic trick [that] creates the illusion that one thing is happening, when in fact something else is going on" (shiny.rstudio.com/articles/underโฆ).
๐ฉ๐ผโ๐ป How do you set up your own #ShinyApp?
It's easy in #rstats! Start a new #Rproject and select "Shiny Application". It will create a project with an "app.R" file for you โจ
Once it's open, you can replace the code that is already in the "app.R" file with this code snippet below๐ It does all the magic and shows how you can build a simple #ShinyApp ๐ฎ
You have checkboxes on the left side that let you choose countries (it's the ISO3 abbreviation, so "RWA" stands for Rwanda) and, depending on what you selected, your #ShinyApp will show a (non-realistic) population size for each country in a new plot.
Today, we'll discover how you can use the power of #rstats to create an interactive #shinyapp โจ
๐ก What is a ShinyApp?
Shiny is a framework that allows you to create web applications - ShinyApps โบ๏ธ You can use them for multiple purposes - to visualize data ๐จ (for instance the Scottish Household Survey by @ViktErik, bit.ly/3TqZevY, ...