automation

Text mining: Topic models

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. What is a topic model? Have you dreamed of a day when algorithms can quickly scan through your textbooks and give you a bullet point summary? How convenient! No more tedious reading! Actually, there are algorithms out there that do automatic summarization of large-scale corpus. They are called topic models. In building topic models, we basically ask computers to discover some abstract topics from the text.

Text mining: Semantic network

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. To understand what a semantic network looks like, go ahead and run the code below. library(quanteda) library(ggplot2) reviews_tok <- tokens(review_corpus, remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE) reviews_tok <- tokens_select(reviews_tok, pattern = stopwords('en'), selection = 'remove') reviews_tok <- tokens_select(reviews_tok, min_nchar=3, selection = 'keep') reviews_dfm <- dfm(reviews_tok) #create a feature co-occurrence matrix (FCM) review_fcm <- fcm(reviews_dfm) #extract the top 50 frequent terms from the FCM object feat <- names(topfeatures(review_fcm, 50)) #trim the old FCM object into a one that contains only the 50 frequent terms fcm_select <- fcm_select(review_fcm, pattern = feat) set.

Text mining: discover insights

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. Now you are on course to try basic text mining techniques to extract insights from textual data. In this tutorial, we will try four techniques: simple word frequency, word cloud, n-grams, and keyness. Simple word frequency Suppose we want to see how often the word “noisy” appears in Airbnb reviews from the three cities respectively.

Clean messy text

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. Why text cleaning? Textual data are always messy. The data may contain words that, if taken out of context, would be meaningless. You may also encounter a group of different words which convey the same meaning. Or you might have to convert slangs and acronyms into standard English, or emojis into something computer can recognize.

From corpus to document-feature matrix

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. Text mining: From corpus to DFM There is a lot of interest in quantifying and visualizing textual data. Texts reveal our thoughts, our personality, and the pulse of a society. We broadly refer to the quantification of text as text mining. Thanks to the developments in Natural Language Processing and Information retrieval, we now have a wide selection of easy-to-use R libraries for cleaning, transforming, quantifying, and visualizing text.

Sentiment analysis

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. During the 2012 US presidential election, Twitter, in partnership with several polling agencies, launched something called Twitter Political Index. The idea was to track candidates’ popularity among voters based on sentiment expressed in tweets. Back then, such idea was a novelty. Nowadays, sentiment analysis of social media text has been widely applied in marketing/PR, electoral forecasting, and sports analytics.

Visualizing virality

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. We often wonder which user and what kinds of tweets are more viral. In the divided United States of America, a question that may interest many of you is: which political party’s messages attract more attention and positive responses from the public? In the following example, we will analyze 3,197 tweets from @GOP and 2,337 tweets by @TheDemocrats since July 2017.

Make Wordclouds

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. Wordclouds are perhaps the most basic way of representing text data. You can simply use wordclouds to reveal important topics in a large body of tweets or to get a sense of user demographics based on keywords used in Twitter bio pages. Do I need new libraries? Yes, we will use quanteda for creating wordclouds.

Collect YouTube Data

Tufekci (2014) wrote that “Twitter has become to social media scholars what the fruit fly is to biologists—a model organism.” But, let’s not forget that there are so many web platforms out there. Arguably, Facebook provides far richer insights than Twitter given its comparatively larger user base and higher penetration rate around the world. Unfortunately, Facebook has shutted down much of its API, making our previous tutorials on Facebook-based data mining obsolete.

Collect Twitter user info

This post is a static and abbreviated version of this interactive tutorial on using R for social data analytics. Collecting user information? That sounds creepy! Not at all. We will conduct the data collection in strict compliance with Twitter’s developer terms. In fact, just like the rate limits imposed on collecting tweets, Twitter makes it very limited as to what kind of user profile data are available through its API.