This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
Wordclouds are perhaps the most basic way of representing text data. You can simply use wordclouds to reveal important topics in a large body of tweets or to get a sense of user demographics based on keywords used in Twitter bio pages.
Do I need new libraries?
Yes, we will use quanteda for creating wordclouds. quanteda (https://quanteda.io/) is a library built for advanced text mining, It is fairly new and includes a great deal of cutting-edge text mining techniques such as topic modeling and semantic network analysis. We will cover the more advanced use later in the semester. For now, we stick with the simple act of generating wordclouds.
Below we will use pre-collected Twitter user data, which are saved as a CSV file on my server. We use read.csv() to download the data into the data frame users. The data contain 966 users with #maga (Make America Great Again) hashtag on Twitter bios.
We convert the information in the description column (which stores the user bios) to characters (note: this is optional if your raw data are already standardized as characters). We then convert the text into something called dfm.
What is dfm?
It stands for document-feature matrix. It is, simply put, a structured and quantified text format that many analyses in quanteda are based on.
In creating the dfm, we remove stopwords, numbers, symbols, and punctuations. Lastly, we turn dfm into a wordcloud. You can tweak parameters in the textplot_wordcloud() function to change the colors and layouts of the wordcloud. For example, if you want to show fewer words, set _maxwords for a lower number.
library(quanteda)
library(readr)
users <- read.csv("https://curiositybits.cc/files/users.csv")
users$description <- as.character(users$description)
dfm <- dfm(users$description, remove = c(stopwords("english"), remove_numbers = TRUE, remove_symbols = TRUE, sremove_punct = TRUE))
#uncomment the next line to get a wordcloud based on hashtags.
#dfm <- dfm_select(dfm, pattern = ("#*"))
set.seed(100)
textplot_wordcloud(dfm, min_size = 1.5, min_count = 10, max_words = 100)
Have you spotted anything interesting in the wordcloud?
There are probably many words in the wordcloud that, if taken out of context, would be subject to many interpretations. To reduce the noise in the data, we can try showing only hashtags as many users express identities through hashtags.
To do this is simple. Just find the following line in the code block and uncomment it. Then re-run the code. This line of code extracts strings that begin with # from dfm.
dfm <- dfm_select(dfm, pattern = ("#*"))