This post also appears on Medium.
There are eleven computing devices in my home office: my two work laptops, three spare Linux laptops, a Raspberry Pi desktop, a smartphone, two old and new iPads, a Chromebook, and a Kindle. They are all vying for the home WiFi and my attention.
As a tech aficionado, I work and live with many screens, browser tabs, and cloud servers. Digital distraction is real and feels like enslavement.
I am currently working with my colleague at UMass and the Australian National University to build a social media tracker for the upcoming the 2019 Philippine General Election.
The Shiny app is now in beta testing, which you can access from the links below.
Dashboard showing how candidates use Twitter
Dashboard showing Twitter conversation networks
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
What is a topic model?
Have you dreamed of a day when algorithms can quickly scan through your textbooks and give you a bullet point summary? How convenient! No more tedious reading! Actually, there are algorithms out there that do automatic summarization of large-scale corpus. They are called topic models. In building topic models, we basically ask computers to discover some abstract topics from the text.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
To understand what a semantic network looks like, go ahead and run the code below.
library(quanteda) library(ggplot2) reviews_tok <- tokens(review_corpus, remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE) reviews_tok <- tokens_select(reviews_tok, pattern = stopwords('en'), selection = 'remove') reviews_tok <- tokens_select(reviews_tok, min_nchar=3, selection = 'keep') reviews_dfm <- dfm(reviews_tok) #create a feature co-occurrence matrix (FCM) review_fcm <- fcm(reviews_dfm) #extract the top 50 frequent terms from the FCM object feat <- names(topfeatures(review_fcm, 50)) #trim the old FCM object into a one that contains only the 50 frequent terms fcm_select <- fcm_select(review_fcm, pattern = feat) set.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
Now you are on course to try basic text mining techniques to extract insights from textual data. In this tutorial, we will try four techniques: simple word frequency, word cloud, n-grams, and keyness.
Simple word frequency
Suppose we want to see how often the word “noisy” appears in Airbnb reviews from the three cities respectively.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
Why text cleaning?
Textual data are always messy. The data may contain words that, if taken out of context, would be meaningless. You may also encounter a group of different words which convey the same meaning. Or you might have to convert slangs and acronyms into standard English, or emojis into something computer can recognize.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
Text mining: From corpus to DFM There is a lot of interest in quantifying and visualizing textual data. Texts reveal our thoughts, our personality, and the pulse of a society. We broadly refer to the quantification of text as text mining. Thanks to the developments in Natural Language Processing and Information retrieval, we now have a wide selection of easy-to-use R libraries for cleaning, transforming, quantifying, and visualizing text.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
During the 2012 US presidential election, Twitter, in partnership with several polling agencies, launched something called Twitter Political Index. The idea was to track candidates’ popularity among voters based on sentiment expressed in tweets. Back then, such idea was a novelty. Nowadays, sentiment analysis of social media text has been widely applied in marketing/PR, electoral forecasting, and sports analytics.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
We often wonder which user and what kinds of tweets are more viral. In the divided United States of America, a question that may interest many of you is: which political party’s messages attract more attention and positive responses from the public?
In the following example, we will analyze 3,197 tweets from @GOP and 2,337 tweets by @TheDemocrats since July 2017.
This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics.
Wordclouds are perhaps the most basic way of representing text data. You can simply use wordclouds to reveal important topics in a large body of tweets or to get a sense of user demographics based on keywords used in Twitter bio pages.
Do I need new libraries?
Yes, we will use quanteda for creating wordclouds.