During this session, we will primarily explore some more advanced topics within R, including alternative syntaxes and languages you can use from within R for various purposes, alternative ways of organizing and analyzing your information, and applications to more advanced analyses like text analysis, Bayesian analysis, or survey techniques.

dplyr

dplyr, a package maintained by Hadley Wickham, provides useful alternative functions for data exploration analogous to what we explored in the last session. It also provides an alternative syntax for performing a series of operations that can increase efficiency.

dplyr improves upon the previous plyr package.

Analysis and Manipulation

Suppose again that you load the QoG data we have used in previous classes.

dplyr provides a variety of functions that can allow us to see our data differently or to change it to suit our needs. Check out this vignette for more details.

For example, you can filter data to only investigate a subset.

# Filter to only include countries with imputed polity scores
# greater than 5
filter <- filter(qog, fh_ipolity2 > 5)
# Compare filtered to unfiltered (just checking some
# pertinent columns)
qog[1:10, c(2, 8:10)]

##                  cname   aid_cpsc aid_crnc aid_crnio
## 1          Afghanistan         NA       30         3
## 2              Albania         NA       20         1
## 3              Algeria         NA       20         5
## 4              Andorra         NA       NA        NA
## 5               Angola         NA       22         2
## 6  Antigua and Barbuda         NA        9         2
## 7           Azerbaijan         NA       20         2
## 8            Argentina         NA       25         2
## 9            Australia 2606557696       NA        NA
## 10             Austria  461456512       NA         1

filter[1:10, c(2, 8:10)]

##                  cname   aid_cpsc aid_crnc aid_crnio
## 1              Albania         NA       20         1
## 2              Andorra         NA       NA        NA
## 3  Antigua and Barbuda         NA        9         2
## 4            Argentina         NA       25         2
## 5            Australia 2606557696       NA        NA
## 6              Austria  461456512       NA         1
## 7              Bahamas         NA        1         1
## 8           Bangladesh         NA        1         1
## 9              Armenia         NA       21         2
## 10            Barbados         NA       10         5

You can also use the group_by function to group and summarize.

summarise(group_by(qog, ht_region), mean(fh_ipolity2, na.rm = T))

## # A tibble: 10 × 2
##                                  ht_region `mean(fh_ipolity2, na.rm = T)`
##                                     <fctr>                          <dbl>
## 1  1. Eastern Europe and post Soviet Union                       6.875867
## 2                         2. Latin America                       7.487500
## 3        3. North Africa & the Middle East                       3.716280
## 4                    4. Sub-Saharan Africa                       5.147255
## 5      5. Western Europe and North America                       9.904270
## 6                             6. East Asia                       6.500000
## 7                       7. South-East Asia                       4.892458
## 8                            8. South Asia                       5.609441
## 9                           9. The Pacific                       8.516396
## 10                       10. The Caribbean                       9.288803

sample_ allows you to sample random rows.

sample_n(qog, size = 20)

count allows you to summarize observations according to a grouping, much like we did with table previously.

count(qog, ht_region)

## # A tibble: 10 × 2
##                                  ht_region     n
##                                     <fctr> <int>
## 1  1. Eastern Europe and post Soviet Union    28
## 2                         2. Latin America    20
## 3        3. North Africa & the Middle East    20
## 4                    4. Sub-Saharan Africa    49
## 5      5. Western Europe and North America    27
## 6                             6. East Asia     6
## 7                       7. South-East Asia    11
## 8                            8. South Asia     8
## 9                           9. The Pacific    12
## 10                       10. The Caribbean    13

Pipes

In addition to providing these functions, dplyr inherits piping: a syntax originally found in the magritrr package. Pipes are simply a syntax represented by %>% that allow you to condense code and apply multiple operations at the same time, rather than in separate lines.

For example, suppose you wanted to eliminate observations with high Polity scores from QoG, then group by region, and then find the mean value of fh_ipolity2, as we did previously.

You could construct these in separate lines:

filter <- filter(qog, fh_ipolity2 > 7)
group <- group_by(filter, ht_region)
result <- summarise(group, mean(fh_ipolity2, na.rm = T))
result

## # A tibble: 10 × 2
##                                  ht_region `mean(fh_ipolity2, na.rm = T)`
##                                     <fctr>                          <dbl>
## 1  1. Eastern Europe and post Soviet Union                       9.039216
## 2                         2. Latin America                       8.466667
## 3        3. North Africa & the Middle East                       9.083334
## 4                    4. Sub-Saharan Africa                       8.576987
## 5      5. Western Europe and North America                       9.904270
## 6                             6. East Asia                       9.458334
## 7                       7. South-East Asia                       7.750000
## 8                            8. South Asia                       8.500000
## 9                           9. The Pacific                       9.236342
## 10                       10. The Caribbean                       9.288803

Or, with dplyr and the %>% notation, you could combine them into a single section:

qog %>% filter(fh_ipolity2 > 7) %>% group_by(ht_region) %>% summarise(mean(fh_ipolity2, 
    na.rm = T))

## # A tibble: 10 × 2
##                                  ht_region `mean(fh_ipolity2, na.rm = T)`
##                                     <fctr>                          <dbl>
## 1  1. Eastern Europe and post Soviet Union                       9.039216
## 2                         2. Latin America                       8.466667
## 3        3. North Africa & the Middle East                       9.083334
## 4                    4. Sub-Saharan Africa                       8.576987
## 5      5. Western Europe and North America                       9.904270
## 6                             6. East Asia                       9.458334
## 7                       7. South-East Asia                       7.750000
## 8                            8. South Asia                       8.500000
## 9                           9. The Pacific                       9.236342
## 10                       10. The Caribbean                       9.288803

You may be thinking: this all sounds great, but what’s a “tibble”…

tidyverse

In addition to many other great R tools, Hadley Wickham has created and maintained a suite of products called the tidyverse that provide an alternative for many of the simple procedures we have seen so far in R. All of these are integrated into a single package (tidyverse) or can be called on their own (e.g., tidyr), but are designed to have a uniform syntax and set of options that makes it easier for users of one set of tools within the tidyverse to learn and transition to another set of tools. The piping as implemented in dplyr for example, might remind you of the + syntax used with ggplot2, which is part of tidyverse.

Loading tidyverse will include several packages, in particular ggplot2, tibble, tidyr, readr, purrr and dplyr. tidyverse also encapsulates stringr and forcats, which we will cover in this segment. (As an aside, the cat jokes are strong here. Check out #Rcatladies on twitter for more.) You can read full descriptions of all the components here.

So to answer that question from above, “tibbles” are the tidyverse alternative to dataframes in base R. Note that while some tidyverse functions will take dataframes as well, if you need a tibble object you can always use as_tibble(). One of the big differences between a tibble and a dataframe is–as previous lectures have cautioned–data types are not automatically transformed. That is, for example, strings are never automatically converted to factors (more in this in a moment). Another difference is in subsetting behavior. You can still use $ as we have been to select a single variable to view or summarize, but you can also use [[:

data(iris)
iris[["Sepal.Length"]]

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

If you want to go full-on in the native syntax, you can also use a pipe like so:

iris %>% .$Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

# or
iris %>% .[["Sepal.Length"]]

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

For more on tibbles, check out the relevant chapter from the assigned reading in R for Data Science.

forcats

forcats provides some clean methods for establishing factor levels and variables.

haven

haven provides functions for reading in data from other program formats, such as SAS, SPSS, and STATA. Note that haven’s functionality allows you to read in .dta files from STATA 14, whereas the base function in foreign only reads through STATA 12.

readr

readr is particularly useful for evaluating data stored in spreadsheets/tables.

tidyr

tidyr focuses on the structure of your data. Previous materials have covered the reshape package, and reshape2 provides additional, more intuitive functionality. Both are still available for reshaping your data, as we did in previous weeks, from long to wide format and vice versa. tidyr takes over where these packages leave off. The official vignette provides an excellent overview of how its functionality will be helpful in cleaning your data.

stringr

As you may by now be aware, strings are a very important component of data analysis but can be a very cumbersome one if handled improperly. This is where stringr comes in. In general, stringr allows you to manipulate character strings, as well as to extract meaningful information stored in strings. For example, you might want to subset a vector of strings to only include certain values, or you may want to count how many of a certain value are contained in string variables. We have previously covered how to do this kind of manipulation if you are interested in extracting full character strings (for example, every instance of “Afghanistan”). stringr operates alongside R’s built-in capacity to handle regular expressions in order to perform a more advanced version of this function.

Regular expressions

Regular expressions are a series of characters that you can use to format a search. Some of the underlying concepts will be familiar if you have ever used special search terms on, for example, internet searches or library websites, where " " give you items with an exact match, and * serves as a wildcard (so that you can, for example, search for doc* and receive results with both doc, document, documents, documentary, etc.), etc. Regular expressions invoke this type of logic to allow you to more precisely search for terms or character fragments that you need.

The set of regular expressions is fairly long and complex, but some key characters you should know include:

( ) groups characters
[ ] used for “or” conditions
. can match anything
\ is an escape character, so that you can reference punctuation-type characters directly even when they would otherwise be operators
^ searches from the beginning
$ searches from the end

The strings overview in R for Data Science provides a very comprehensive explanation of all of these and more.

You can access regular expressions outside of the stringr package simply using base R’s grep, grepl, or regexp. Use this guide to see more examples of how the arguments for these functions will allow you to incorporate regular expressions into your work. Note that these are great for textual analysis, but are also great for systematically renaming variables, variable values, or even files. For example, suppose you read in a series of .csv files at once that you need to merge into a single dataset. Each file represents data from a different country and is named something like datasource-countryname-year.csv. You can use regular expressions to split this string by -, select the second element to extract the name of the country from the filename, and use it to populate a column named country in the final, merged dataset.

This cheatsheet provides a comprehensive list of different parts of regular expressions you can use, and this website will actually allow you to test your regular expressions as you create them, to see highlighted parts of text that the expression will correctly identify.

lubridate package

While technically not part of the main tidyverse set of packages, you should be aware of the lubridate package, which uses tidy-esque syntax to make your life easier in evaluating and converting date objects (which, like strings, can cause huge problems if handled incorrectly).

As mentioned in previous weeks, R often interprets dates as strings. This makes the creation of time-related variables (for example, duration variables) more challenging. lubridate lets you specify the format of the dates listed as strings or numbers in your data—whether mdy() or ymd() or dmy()—and it automatically interprets them into date objects. These functions are also extensible to include hour and minute timestamps. You can also use the mutate function in the tidyverse to convert date/time information stored across multiple variables into a single date/timestamp variable using make_datetime. Even more explanation is available in the R for Data Science section.

More advanced applications

All of the tools and techniques we have discussed so far in the course can be applied in combination to tackle even the most technical of projects. In this segment we will examine just a few advanced applications that weave together knowledge from earlier parts of this class, and we will end with some tips on how to continue learning on your own.

Introductory text analysis: R vs. tidyr

There are several approaches one can take to text analysis in R, ranging from basic text mining to more advanced models using LDA, for example. In this section, we will cover some basic text analysis techniques across several packages to illustrate the difference in approach between using base R and dplyr syntax.

As a preliminary step before even worrying about packages and functions, you should organize your textual data to read into R. Typically this means that you want each document in your “corpus” to be a separate .txt file within a single directory that you can easily reference from within R. This may require data cleaning done outside of R, for example if you need to OCR .pdf documents or download files before converting to .txt. Your data may also be stored in a .csv format, however, with columns indicating things like “author”, “date”, and then one including the full “text” for each document in your corpus already.

For now assume that you have multiple files, each with text in them. Once you have your text in the right place, you will want to load this as a directory:

# For mac
mydirectory <- file.path("~", "filelocation", "mytexts")

# For PC
mydirectory <- file.path("C:", "filelocation", "mytexts")

# Look at the separate files you loaded
dir(mydirectory)

tm

The next steps will illustrate text analysis techniques with standard R syntax using the tm package. The first step here is to convert your text into a “Corpus”-type object.

# If all documents are separate .txt
mydocs <- Corpus(DirSource(mydirectory))

# If text is a column in a dataframe, read in from a .csv for
# example
mydocs <- Corpus(VectorSource(dataframename$text))

Note here if you’re reading the tm documentation that Corpus is an alias for VCorpus (a volatile corpus)—either may be used. Also note that when reading in your text data, the default is for documents to be read in English. If your data are in another language, you will have to specify it. For example, if your documents are in French:

mydocs <- Corpus(DirSource(mydirectory), readerControl = list(language = "fre"))

The language options are provided with IETF language codes and adopted through the NLP package. You can read more about IETF codes here.

Once you have a corpus object, you are ready to clean your text for analysis. The steps to do this with tm are more fully explained here, but I will highlight the general procedure using a built-in corpus accessible through the tm package, acq, which contains 50 news articles about corporate acquisitions from Reuters.

In general, you will want to:

Remove characters that interfere with the ability of the program to identify words, including punctuation, numbers and symbols. Here is where you can use your new regex skills to locate specific symbols to exclude as well.

data(acq)
mydocs <- tm_map(acq, removePunctuation)

mydocs <- tm_map(mydocs, removeNumbers)

# Exclude the # sign, replace it with nothing

for(i in seq(mydocs))   
   {   
     docs[[i]] <- gsub("\#", "", docs[[i]])   
  }

By convention, change all words to lowercase so that every instance is counted together

mydocs <- tm_map(mydocs, tolower)

Eliminate “stopwords,” which are common words that, while occurring frequently in texts, do not contribute to the meaning (in English, for example “a” and “the” are stopwords). Here, you will need to identify what dictionary tm should use to find stopwords. If your data are not in English, use the appropriate stopword dictionary instead, but note that the options for identifying stopwords are available in fewer languages.

mydocs <- tm_map(mydocs, removeWords, stopwords("english"))

Next, stem the words in your text. This again relies on the underlying language of your text, but essentially will remove endings so that when you identify word counts or sentiments, it will treat, for example, “document” and “documents” as instances of the same essential word. This function relies on the SnowballC package.

library(SnowballC)
mydocs <- tm_map(mydocs, stemDocument)

Now fix all the extra white space we created by eliminating text.

mydocs <- tm_map(mydocs, stripWhitespace)

Finally, resave your data as plain text to insure easy analysis.

mydocs <- tm_map(mydocs, PlainTextDocument)

Once these preliminary steps are completed, your next general task is to create a DocumentTermMatrix object, which will allow you to “quantify” the text in terms of word counts. For idiosyncratic reasons, you may need to create and immediately transpose this matrix–it depends on what information you want. This matrix will also serve as a basis for evaluating differences across your documents, if you like.

mydocsDTM <- DocumentTermMatrix(mydocs)

# If you want to evaluate differences among documents in
# terms of distance:
mydocsDTM.mat <- as.matrix(mydocsDTM)

# You can add back in rownames if you create an object
# listing the names in your text directory
names <- list.files("filelocation")
rownames(mydocsDTM.mat) <- names

In our case, the original articles in acq just came with numbers, which we can add back in accordingly:

rownames(mydocsDTM.mat) <- names(acq)

library(proxy)

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

# Note that to evaluate dissimilarity you will need to
# specify the method for calculating distance--check the
# documentation
dissim <- dist(mydocsDTM.mat, method = "cosine")

# You can also use this information to create a cluster plot
# with a built in plot function, again choosing a method. You
# could also add names to this manually with labels=names if
# you are not using built-in data

hcluster <- hclust(dissim, method = "ward.D")
plot(hcluster)

# You may also care about word frequency

freq <- colSums(mydocsDTM.mat)
ordered <- order(freq)

freq[head(ordered)]

##     ability  acceptance accompanied   acquiring       acres      acting 
##           1           1           1           1           1           1

freq2 <- sort(colSums(mydocsDTM.mat), decreasing = T)
head(freq2, 10)

##    said    dlrs     pct     mln company     inc  shares    reut   stock 
##     186     100      70      65      63      53      52      50      46 
##    will 
##      35

See the vignette linked above for how to graph word frequency using ggplot2. You can also restrict attention only to words that have frequency above a certain threshhold.

You may also be interested in the association between particular words, which is a beginning step toward topic analysis.

# Fill in words in the second argument, set a lower bound for
# correlation
findAssocs(mydocsDTM, "good", corlimit = 0.75)

## $good
##        ability    accompanied       activity          along     analytical 
##           1.00           1.00           1.00           1.00           1.00 
##          arena       believes         better            big         bilion 
##           1.00           1.00           1.00           1.00           1.00 
##           book          boost       boosting     calculated capitalization 
##           1.00           1.00           1.00           1.00           1.00 
##         center        command       comments     components     constitute 
##           1.00           1.00           1.00           1.00           1.00 
##       contrary   contributing      disagreed   divestitures       dividend 
##           1.00           1.00           1.00           1.00           1.00 
##      elevating      endeavors       enormous       exposure           fact 
##           1.00           1.00           1.00           1.00           1.00 
##        fetched           fuel       globally           goes          going 
##           1.00           1.00           1.00           1.00           1.00 
##        heavily          heavy         helped      investing        jeffery 
##           1.00           1.00           1.00           1.00           1.00 
##          kinds         lesser          lewis           lift         lipper 
##           1.00           1.00           1.00           1.00           1.00 
##           mark          might         others       outlined      partially 
##           1.00           1.00           1.00           1.00           1.00 
##     percentage    performance         perrin       preclude        premium 
##           1.00           1.00           1.00           1.00           1.00 
##       probably     questioned      realistic        reduced           road 
##           1.00           1.00           1.00           1.00           1.00 
##          rumor          split       suggests          takes           talk 
##           1.00           1.00           1.00           1.00           1.00 
##        thereby         theyre         theyve         travel           true 
##           1.00           1.00           1.00           1.00           1.00 
##        tuesday    undervalued        unhappy     valuations      variation 
##           1.00           1.00           1.00           1.00           1.00 
##         volume          water      yesterday        reflect         market 
##           1.00           1.00           1.00           0.95           0.91 
##    eckenfelder        partial          place         strong      operating 
##           0.89           0.89           0.89           0.89           0.87 
##        spinoff         higher           make          since     profitable 
##           0.85           0.81           0.81           0.81           0.75

If you’re interested in more complex topic model designs, check out the topicmodels package. There are other packages like tm that you might want to explore as well, including qdap. See this vignette for more details.

tidytext

Much of the syntax in the above functions will be familiar, even if the operations are not. The sort, head, and plot commands, as well as optional arguments and the regular expressions component weave in techniques that you have already learned. Yet as with many tasks in R, there is more than one way to perform text analysis. Many additional packages and functions build on tm, but another option is to approach text analysis through the language of dplyr using the tidytext package.

Here again we will use a built-in dataset, this time with 2,246 Associated Press articles from the topicmodels package, to walk through the major functions in the package. The same steps apply as above, however, if your data are not already in a corpus format.

library(topicmodels)
data("AssociatedPress", package = "topicmodels")

AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

You can “clean up” these data in a single line of code. Because the AssociatedPress data are already in a DocumentTermMatrix format, you don’t need to separate out by word, but you could do that with unnest_tokens if your data are not structured this way. Note that unnest_tokens also allows you to change the ngram for your analysis–that is, how many words or word segments are evaluated by comparison. For example if your ngram is 2, you are evaluating 2-word phrases.

tidied <- tidy(AssociatedPress)
tidied

## # A tibble: 302,031 × 3
##    document       term count
##       <int>      <chr> <dbl>
## 1         1     adding     1
## 2         1      adult     2
## 3         1        ago     1
## 4         1    alcohol     1
## 5         1  allegedly     1
## 6         1      allen     1
## 7         1 apparently     2
## 8         1   appeared     1
## 9         1   arrested     1
## 10        1    assault     1
## # ... with 302,021 more rows

Note that using grep and other subsetting techniques we have explored, you can also narrow down what documents you want to look at within your broader set of texts. For example, if you only wanted to include articles that mentioned government actions or actors, you might want to select based on whether the document includes words like “government” or “president” before you tidy your text.

selection <- "government|president"
final.docs <- mydocs[grepl(selection, mydocs$text), ]

You can perform similar tasks to the ones we did above with tm using dplyr syntax, such as removing stopwords, stemming, and obtaining word counts:

data("stop_words")

cleaned <- tidied %>% anti_join(stop_words, by = c(term = "word"))

cleaner <- cleaned %>% mutate(word = wordStem(term))

cleaner %>% count(word, sort = TRUE)

## # A tibble: 6,722 × 2
##      word     n
##     <chr> <int>
## 1  report  1144
## 2  nation  1061
## 3  offici   935
## 4    time   878
## 5   peopl   872
## 6  presid   870
## 7     dai   833
## 8  govern   773
## 9    call   760
## 10   week   748
## # ... with 6,712 more rows

With these cleaned data, and with similar join syntax, we can also do other operations, such as sentiment analysis. You have several sentiment lexicons to choose from, but for this example we will simply use bing.

ap_sentiments <- cleaner %>% inner_join(get_sentiments("bing"), 
    by = c(term = "word"))

ap_sentiments

## # A tibble: 27,068 × 5
##    document     term count     word sentiment
##       <int>    <chr> <dbl>    <chr>     <chr>
## 1      1689      bug     5      bug  negative
## 2      2141      bug     1      bug  negative
## 3      1489 disarray     1 disarrai  negative
## 4      1507 disarray     1 disarrai  negative
## 5      1631 disarray     1 disarrai  negative
## 6      2013 disarray     1 disarrai  negative
## 7      2101 disarray     2 disarrai  negative
## 8      2219 disarray     1 disarrai  negative
## 9      1287   uphold     1   uphold  positive
## 10     1394   uphold     1   uphold  positive
## # ... with 27,058 more rows

Check out this vignette for more on tidytext techniques, as well as this guide, featuring Jane Austen novels, for a more in-depth look at how to handle different corpus structures and for discussions of plotting sentiment.

More, different syntaxes: R as a front-end

Besides the divide in syntax between base R and the tidyverse, R can also be used in more advanced applications as a front-end to other software and languages.

rPython

The rPython package allows you to run Python code directly through R. As is common with many adaptations of other languages or programs to R, the functions in this program will have syntax from R that runs code from another language or program. This is useful if you need a particular functionality of one program or language, but are most comfortable or familiar with R’s interface and syntax.

RSelenium

RSelenium is a great example of this. If you are not familiar with Python syntax for scraping webpages, for example, or have a more complex scraping project that requires you to manipulate a website at the same time, you can invoke Selenium (a java-based program) through R to do the necessary tasks. Note that to do this, you may want to run a Docker container. Instructions on how to do that are in the linked vignette above.

Rcpp

The Rcpp package allows you to leverage both R and C++ syntaxes and libraries. This can be particularly useful as you optimize computationally intensive model estimation, or for creating your own packages. Read through the explanation in Advanced R for more.

BUGS, JAGS, and Stan

Several different tools exist for Bayesian analysis in R, many requiring external programs. BUGS/WinBUGS/OpenBugs are a set of classic programs for several platforms with its own set of GUIs. If you visit the BUGS site, you’ll notice this warning:

“The programs are reasonably easy to use and come with a wide range of examples. There is, however, a need for caution. A knowledge of Bayesian statistics is assumed, including recognition of the potential importance of prior distributions, and MCMC is inherently less robust than analytic statistical methods. There is no in-built protection against misuse.”

While knowing your data and knowing your method are key in all types of statistical computing practice, it bears mentioning here that relying on packages or programs that automate elements of your analysis also requires you to recognize their functions, limitations, and underlying assumptions.

Note about NIMBLE: NIMBLE is another option for Bayesian analysis that builds on BUGS and allows you to write your own algorithms if you choose. Handle with care.

JAGS works much the same as BUGS and follows similar coding conventions, and at this stage is likely most commonly used. If you are just getting started with JAGS, you will notice that it requires data in a list format, a full model specification, and instructions about the sampling process (burn-in, number of iterations, etc.).

A basic model will take the general form

model{
  # where n is number of observations in data
  for(i in 1:n){
# each value of the dependent variable will be determined by some density function (e.g., dbern if it is bernoulli) with a given parameter (e.g., for shape or scale of that distribution)
    y[i] ~ densityfn(param)
  }
  # parameter value of the distribution for y has a prior density of its own, with a parameter or set of parameters as well (e.g., dunif(0,1))
  param ~ paramdensity(param2)
}

You can specify this model within R itself, or save it out as a .bugs file that you later call with JAGS:

modelString = "model{
    y[i] ~ densityfn(param)
  }
  param ~ paramdensity(param2)
}"
writeLines(modelString, con = "mymodel.bug")

You can then call this model directly from a .bug file (or if you saved it as a .jag file) in your working directory when running JAGS. Here I’m using the run.jags function from the runjags package. The arguments are effectively the same if you use rjags, as I do below.

post <- run.jags("mymodel.bug",
                 # what values am I interested in for results
                 monitor="param",
                 # data must be in the form of a list object: if you have binary or factor explanatory variables, you will want to use model.matrix and specify which variables are "contr.treatment"
                 data=list(y=y, n=length(y)), 
                 # how large of a sample do you want? the bigger it is, the longer it will take, but complex models may take longer to converge
                 sample=10000,
                 # throw away the first several samples with poor mixing
                 burnin=1000,
                 # take only a subset of total samples: saves on memory and smooths results
                 thin=10,
                 # you can technically run more than one "chain" of this analysis or spread the sampling over cores (parallelization), but you don't need to
                 n.chains=1)

In terms of preparing your data, as stated above, if you have factor variables for example, you will need to use model.matrix to expand your grid (that is, to create rows and columns for values of each variable for each level of your factors). Factor variables are “contrasts.” For example, if you had a model predicting voting in terms of age, sex, and party, your data preparation might look like this:

y <- vote.variable
x.expand <- model.matrix(~age + sex + party, data = dataname, 
    contrasts = list(sex = "contr.treatment", party = "contr.treatment"))

# for modeling, you may need to remove the intercept
x.expand <- x.expand[-1]

Note that age is not a contrast because it is presumed to be continuous.

Using this setup for your data, you want to estimate a logit (voted y/n), so your JAGS model might look like:

model{
for (i in 1:n){
        y[i] ~ dbern(p[i])
        y.hat[i] <- b0 + b1*x1[i] + b2*x2[i] + b3*x3[i]
        p[i] <- 1/(1+ exp(-y.hat[i]))
}
b0 ~ dnorm(0,1.0e-4)
b1 ~ dnorm(0,1.0e-4)
b2 ~ dnorm(0,1.0e-4)
b3 ~ dnorm(0,1.0e-4)
}

In this setup you have one b for each of your covariates, you are looping over each observation in you independent variables, and you have the p[i] allowing you to estimate the logit. Choices of priors are up to you–these ones are all the same, but note that you would specify one for each coefficient.

Alternatively, you could collapse some of this with inprod. Here you will note I have changed to using jags.model and related functions from the rjags package. Either syntax or setup will work.

model{
for (i in 1:n){
        y[i] ~ dbern(p[i])
        y.hat[i] <- b0 + inprod(b[]*x[i,])
        p[i] <- 1/(1+ exp(-y.hat[i]))
}
b0 ~ dnorm(0,1.0e-2)
for(j in 1:k){
    b[j] ~ dnorm(0,1.0e-4)
}
}


expanded.model<-jags.model("mymodel.bug", data=list(y=y, x=x.expand, k=3, n=length(y)))
update(expanded.model, 10000)
expanded.model.j <- coda.samples(expanded.model, c("b0","b","p"), n.iter=100000, thin=10)
summary(expanded.model.j)

In this case, you have expanded a grid similar to what you did with model.matrix, so that the set of coefficients you have tracks the number of explanatory variables you include. Note that if you do this, you will want to loop over your b’s to provide priors for each, assuming each gets the same prior. Also note that in creating the second loop we have said “loop over each j element for values 1:k,” meaning that you will now need to specify k (the number of explanatory variables in your model) when you provide your data list.

Finally, note that you could also skip the messy business about y.hat and use the logit() function:

model{
for (i in 1:n){
        y[i] ~ dbern(p[i])
        logit(p[i]) <- b0 + inprod(b[]*x[i,])
}
b0 ~ dnorm(0,1.0e-2)
for(j in 1:k){
    b[j] ~ dnorm(0,1.0e-4)
}

For more example code and some video lectures and slides, check out these resources:

Justin Esarey’s teaching materials
This Github repo with rjags scripts, simulated data, and .bug files for various models

Besides using BUGS and JAGS, another option available to you is Stan, a project headed by Andrew Gelman. Although you can access it through R with rstan, it features syntax much closer to C++. It can be more efficient for particularly complex hierarchical models and includes the No-U-Turn Sampler (NUTS), an optimization of Hamiltonian Monte Carlo. Stan also has a great set of documentation, including a language manual, examples, and tutorials to help you get started and to compare syntax with BUGS. Stan also has a user community Google group to which you can post issues and questions.

Learning new packages

As you begin to explore R more on your own, you will need to adapt to different coding styles and packages in order to do different types of analyses. Learning how to learn is also a key part of the educational process (and certainly a staple of graduate school). Below are some quick guides on how to teach yourself new techniques in R, with examples from survey and Baysian analyses.

Surveys

Suppose that as a new user of R (useR), you are confronted with a survey analysis project. Taking the tools you have already learned in this class, you feel comfortable reading in survey data and doing basic summary analysis and visualization. When it comes to more complex analyses, however, you need guidance on survey weights and how to handle them, because your results will depend on the weights you choose and the type of sample you have.

Step 1: Google

If you google “survey weights in R” or “survey analysis R” you will see that a package called survey has documentation in the first few hits. You will also notice a few posts on R-bloggers and similar sites that provide explainers related to the package, situated in broader context about survey analysis.

Having done this initial search, you have several components of the information you need. First, you discover that the survey package can help you in creating a design object that you use in later analysis using svydesign(). With this function, you can directly specify the name of the weights variable if it already exists in the data. You will also specify the stratification of your sample (or samples).

Once you have created a design object, you will use the built-in functions in survey to do your basic analysis. These functions will take you far if you are manipulating data collected by other researchers. But what if the survey data are your own?

Here you may need to calculate weights yourself. To do this, you will need to use the rake() function in survey, which requires both sample and population margins.

Suppose you had voting data for individuals and want to compute weights by party (R, D, I) and income bracket (High, Medium, Low). Take note of the information that blog posts and the package documentation take into account. You can do this primarily by investigating the arguments of functions and the type of information they take.

In this case, both of these variables of interest are factor variables with 3 levels each and you will need relative frequencies for each level. Following the example in this R-bloggers post, you would create dataframes relevant to population characteristics accordingly:

distr.party <- data.frame(party = c("R", "D", "I"), Freq = nrow(mydata) * 
    c(0.45, 0.5, 0.05))

distr.income <- data.frame(income = c("High", "Medium", "Low"), 
    Freq = nrow(mydata) * c(0.01, 0.6, 0.39))

Now you will use rake() to relate these values to your sample.

raked.leaves <- rake(design = mydata.nowt, sample.margins = list(~party, 
    ~income), population.margins = list(party.dist, income.dist))

Unlike the implementation of packages in other programs (cough STATA), you may need to examine and adjust these weights to be within reasonable bounds. This again requires knowing your data.

Note that survey has its own syntax for estimating models using survey design objects, for example svyglm instead of regular glm. Look at an example from the package documentation:

data(api)
dclus2 <- svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)

model0 <- svyglm(I(sch.wide == "Yes") ~ ell + meals + mobility, 
    design = dclus2, family = quasibinomial())
summary(model0)

## 
## Call:
## svyglm(formula = I(sch.wide == "Yes") ~ ell + meals + mobility, 
##     design = dclus2, family = quasibinomial())
## 
## Survey design:
## svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.82118    0.76726   1.070   0.2916  
## ell         -0.05462    0.02220  -2.461   0.0188 *
## meals        0.02873    0.01820   1.579   0.1232  
## mobility     0.01460    0.03017   0.484   0.6313  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9856512)
## 
## Number of Fisher Scoring iterations: 4

You will notice that the syntax looks very similar to what you are familiar with from our discussion of glm: you specify a dependent variable ~ independent variables, the design object where previously you specified data, and the family of model you would like to estimate (in this case a quasibinomial).

You may not be familiar with the I() notation used to specify the dependent variable. Do not panic. Recall our previous discussion that you can modify variables in place within model commands in R, just as we did with log(y). If you search for I() function in R you will easily find documentation to inform you that I() is an inhibiting function that simply “protects” dataframes so that strings are not converted to factors, for example.

Step 2: More specific searches

Now that you have some basic information about the relevant package and how it functions, you may have more specific questions that pertain to your data. Perhaps you need more information about cluster bootstrapping for resampling. These are questions you can take to Google again, but you might also benefit from taking them directly to specialty sites like Stack Overflow. In doing this, you might find Thomas Lumley’s (the survey package author) page, and will likely find his slides covering more complex applications of the package than are covered in the R-bloggers posts. You will also discover there is a book, “Complex Surveys: A Guide to Analysis Using R” that Lumley wrote that you could reference. [ Note: Because R is open-source, be wary of paying large sums of money for books on its functions or packages—many resources exist for free!]

Step 3: Iterate

Now that you have completed basic analysis and found a few preliminary answers to your initial tasks and questions, iterate the search process. Try sample code from one site, use examples from a vignette related to a package, and google any errors you receive and read about other users who have had those issues on Stack Overflow. At this point, if not sooner, you may go beyond the lonely internet life to reach out to friends, colleagues, and professors who may have done similar work.

If no one is available locally to help, and answers on the internet are sparse, but you see a technique referenced in a paper, you may also be able to directly find code for your problem on an author’s personal webpage or Github repository. You can also perform a general search of Github repos to look for code directly (try searching cluster robust standard errors).

Step 4: DIY

Step 4 won’t be for everyone, but in a pinch you can often write your own functions to solve problems that are not well-managed by existing packages. Some examples might include things like cluster robust standard error calculations for complex, rare model types; uncommon types of cross-validation; particular estimators or samplers. For many of these, the internet will still be your best friend as you are (often) converting math to code. Many times, investing in something of your own design is not necessary because current packages have adequate functionality, but if you find yourself doing the same type of analysis routinely and needing something beyond the prefab, coding your own solution can often be efficient in the long-run. This is how packages are introduced in the first place.

Bayesian estimation

The survey example seems relatively straightforward: identify the relevant package for your needs and use internet resources to help you implement its functions. But what if there are multiple such packages: how do you choose?

In addition to complementary programs like JAGS and stan that can be used with an R interface, there are several packages designed to do Bayesian estimation without needing to fully mathematically specify a model. If you are familiar with Bayes, you can imagine why this creates more need than ever to truly understand what kinds of assumptions a package operates under and what functions go on under the hood.

Here again you may begin by simply googling "Bayesian packages R" or a specific task in Bayesian analysis you want to complete. If you are already aware multiple packages exist, however, (or become aware and overwhelmed quickly by googling), another option is to check out the Cran Task View. The Task View does not list many categories, but for each included category it provides an overview of common packages and their relative advantages. In the Task View for Bayes, you will see a host of packages specifically designed for Bayesian analysis, as well as a series of packages that include functions for particular modeling challenges. If you have a specific task in mind, you can easily search this page for mentions; otherwise, everything is in alphabetical order. Often if you are looking to do Bayesian analysis in a canned routine through one of these packages, your problem is already well-defined by the type of model you are trying to estimate or the type of data you have at your dispoal—these will also help you to narrow the set of possible candidates.

While you should always prioritize the package that has functions most suited to your needs, having a mental catalog of general functions that cover your area of interest will help you to identify possible places to start in the future as new analysis tasks arrive. This section will highlight just a few useful packages to keep in mind.

The arm package is a great general package useful for both MLE and Bayes, with code that accompanies Gelman and Hill (2007). Specifically relevant to Bayes, arm includes the bayesglm() function. When investigating this function, take note of its default behaviors. For example, prior means default to 0 if not set manually. If you specify only one, but have several predictors (independent variables excluding intercept), it is assumed that that single prior mean is the same for all predictors. Otherwise, you need a vector of prior means of length equal to the number of predictors.
MCMCpack contains a number of model types and postestimation analysis functions for Bayes. Andrew Martin and Kevin Quinn provide a set of slides explaining its functions and relative advantage versus using something like BUGS. MCMCpack uses coda as a basis for posterior analysis, so you will need to use extra documentation to understand its full range of functionality. It does include built-in functions for calculating Bayes Factors and posterior model probability. Also note that while functions like MCMCregress do not require you to specify priors in quite the same way as JAGS, say, and that might be more comfortable, it does require you to think about how you want to calculate the marginal likelihood (using the argument marginal.likelihood=) if you want to deviate from the default.
MCMCglmm includes both standard glm’s estimated with MCMC as well as some rarer types that you might be interested in (e.g., zero-inflated and multi-trait models). While similar to lme4 in several ways, MCMCglmm also handles situations where packages like lme4 fail to converge.
lme4 is another general package for glm’s, in particular emphasizing multilevel models, and several Bayes-specific packages require it as a dependency or extend on its functions. This vignette gives a good overview of the differences in lme4’s syntax and approach relative to base glm specifications.
In addition to useful functions for hurdle and zero-inflated models in MLE, pscl also includes Bayesian IRT.
MNP is specifically designed for the implementation of multinomial probit models.
Bolstad provides the bayes.lm function for fitting Bayesian linear models.
blme contains the functions blmer and bglmer for doing linear or glm mixed-effect models. The syntax here is much the same as you would find in lm, lmer, or glmer but the functions in blme allow you to conduct Bayesian or penalized MLE analyses.
brms is a newer package building on Stan that uses syntax similar to lme4. It includes less common models, such as zero-inflated or survival models. It also allows you to do posterior prediction and crossvalidation.

A Note about Venue-Shopping

In the coming weeks, you will learn many of the same techniques we have discussed in R for STATA. Learning both will give you the option to select whichever is most comfortable and convenient for you going forward, but note that their functionality on some dimensions may differ. Some methods or routines have more developed packages for STATA than for R and vice versa, often as a function of departmental conventions (where, e.g., sociologists may be more apt to use STATA). If you are comfortable with the syntax of both and discover that there are not many options for the analysis you want to conduct in R, you may also want to explore STATA code for similar functions. For example, code for survey analysis and for time series analysis is very well developed in STATA, whereas the R packages associated with each may require you to do many of your own calculations or much more data transformation than you would prefer.

R: Advanced Topics

3 February 2017