During this session, we will primarily explore some more advanced topics within R, including alternative syntaxes and languages you can use from within R for various purposes, alternative ways of organizing and analyzing your information, and applications to more advanced analyses like text analysis, Bayesian analysis, or survey techniques.
dplyr
, a package maintained by Hadley Wickham, provides useful alternative functions for data exploration analogous to what we explored in the last session. It also provides an alternative syntax for performing a series of operations that can increase efficiency.
dplyr
improves upon the previous plyr
package.
Suppose again that you load the QoG data we have used in previous classes.
dplyr
provides a variety of functions that can allow us to see our data differently or to change it to suit our needs. Check out this vignette for more details.
For example, you can filter
data to only investigate a subset.
# Filter to only include countries with imputed polity scores
# greater than 5
filter <- filter(qog, fh_ipolity2 > 5)
# Compare filtered to unfiltered (just checking some
# pertinent columns)
qog[1:10, c(2, 8:10)]
## cname aid_cpsc aid_crnc aid_crnio
## 1 Afghanistan NA 30 3
## 2 Albania NA 20 1
## 3 Algeria NA 20 5
## 4 Andorra NA NA NA
## 5 Angola NA 22 2
## 6 Antigua and Barbuda NA 9 2
## 7 Azerbaijan NA 20 2
## 8 Argentina NA 25 2
## 9 Australia 2606557696 NA NA
## 10 Austria 461456512 NA 1
filter[1:10, c(2, 8:10)]
## cname aid_cpsc aid_crnc aid_crnio
## 1 Albania NA 20 1
## 2 Andorra NA NA NA
## 3 Antigua and Barbuda NA 9 2
## 4 Argentina NA 25 2
## 5 Australia 2606557696 NA NA
## 6 Austria 461456512 NA 1
## 7 Bahamas NA 1 1
## 8 Bangladesh NA 1 1
## 9 Armenia NA 21 2
## 10 Barbados NA 10 5
You can also use the group_by
function to group and summarize.
summarise(group_by(qog, ht_region), mean(fh_ipolity2, na.rm = T))
## # A tibble: 10 × 2
## ht_region `mean(fh_ipolity2, na.rm = T)`
## <fctr> <dbl>
## 1 1. Eastern Europe and post Soviet Union 6.875867
## 2 2. Latin America 7.487500
## 3 3. North Africa & the Middle East 3.716280
## 4 4. Sub-Saharan Africa 5.147255
## 5 5. Western Europe and North America 9.904270
## 6 6. East Asia 6.500000
## 7 7. South-East Asia 4.892458
## 8 8. South Asia 5.609441
## 9 9. The Pacific 8.516396
## 10 10. The Caribbean 9.288803
sample_
allows you to sample random rows.
sample_n(qog, size = 20)
count
allows you to summarize observations according to a grouping, much like we did with table
previously.
count(qog, ht_region)
## # A tibble: 10 × 2
## ht_region n
## <fctr> <int>
## 1 1. Eastern Europe and post Soviet Union 28
## 2 2. Latin America 20
## 3 3. North Africa & the Middle East 20
## 4 4. Sub-Saharan Africa 49
## 5 5. Western Europe and North America 27
## 6 6. East Asia 6
## 7 7. South-East Asia 11
## 8 8. South Asia 8
## 9 9. The Pacific 12
## 10 10. The Caribbean 13
In addition to providing these functions, dplyr
inherits piping: a syntax originally found in the magritrr
package. Pipes are simply a syntax represented by %>%
that allow you to condense code and apply multiple operations at the same time, rather than in separate lines.
For example, suppose you wanted to eliminate observations with high Polity scores from QoG, then group by region, and then find the mean value of fh_ipolity2
, as we did previously.
You could construct these in separate lines:
filter <- filter(qog, fh_ipolity2 > 7)
group <- group_by(filter, ht_region)
result <- summarise(group, mean(fh_ipolity2, na.rm = T))
result
## # A tibble: 10 × 2
## ht_region `mean(fh_ipolity2, na.rm = T)`
## <fctr> <dbl>
## 1 1. Eastern Europe and post Soviet Union 9.039216
## 2 2. Latin America 8.466667
## 3 3. North Africa & the Middle East 9.083334
## 4 4. Sub-Saharan Africa 8.576987
## 5 5. Western Europe and North America 9.904270
## 6 6. East Asia 9.458334
## 7 7. South-East Asia 7.750000
## 8 8. South Asia 8.500000
## 9 9. The Pacific 9.236342
## 10 10. The Caribbean 9.288803
Or, with dplyr
and the %>%
notation, you could combine them into a single section:
qog %>% filter(fh_ipolity2 > 7) %>% group_by(ht_region) %>% summarise(mean(fh_ipolity2,
na.rm = T))
## # A tibble: 10 × 2
## ht_region `mean(fh_ipolity2, na.rm = T)`
## <fctr> <dbl>
## 1 1. Eastern Europe and post Soviet Union 9.039216
## 2 2. Latin America 8.466667
## 3 3. North Africa & the Middle East 9.083334
## 4 4. Sub-Saharan Africa 8.576987
## 5 5. Western Europe and North America 9.904270
## 6 6. East Asia 9.458334
## 7 7. South-East Asia 7.750000
## 8 8. South Asia 8.500000
## 9 9. The Pacific 9.236342
## 10 10. The Caribbean 9.288803
You may be thinking: this all sounds great, but what’s a “tibble”…
In addition to many other great R tools, Hadley Wickham has created and maintained a suite of products called the tidyverse
that provide an alternative for many of the simple procedures we have seen so far in R. All of these are integrated into a single package (tidyverse
) or can be called on their own (e.g., tidyr
), but are designed to have a uniform syntax and set of options that makes it easier for users of one set of tools within the tidyverse to learn and transition to another set of tools. The piping as implemented in dplyr
for example, might remind you of the +
syntax used with ggplot2
, which is part of tidyverse
.
Loading tidyverse
will include several packages, in particular ggplot2
, tibble
, tidyr
, readr
, purrr
and dplyr
. tidyverse
also encapsulates stringr
and forcats
, which we will cover in this segment. (As an aside, the cat jokes are strong here. Check out #Rcatladies on twitter for more.) You can read full descriptions of all the components here.
So to answer that question from above, “tibbles” are the tidyverse
alternative to dataframes in base R. Note that while some tidyverse
functions will take dataframes as well, if you need a tibble object you can always use as_tibble()
. One of the big differences between a tibble and a dataframe is–as previous lectures have cautioned–data types are not automatically transformed. That is, for example, strings are never automatically converted to factors (more in this in a moment). Another difference is in subsetting behavior. You can still use $
as we have been to select a single variable to view or summarize, but you can also use [[
:
data(iris)
iris[["Sepal.Length"]]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
If you want to go full-on in the native syntax, you can also use a pipe like so:
iris %>% .$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
# or
iris %>% .[["Sepal.Length"]]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
For more on tibbles, check out the relevant chapter from the assigned reading in R for Data Science.
forcats
provides some clean methods for establishing factor levels and variables.
haven
provides functions for reading in data from other program formats, such as SAS, SPSS, and STATA. Note that haven’s functionality allows you to read in .dta files from STATA 14, whereas the base function in foreign
only reads through STATA 12.
readr
is particularly useful for evaluating data stored in spreadsheets/tables.
tidyr
focuses on the structure of your data. Previous materials have covered the reshape
package, and reshape2
provides additional, more intuitive functionality. Both are still available for reshaping your data, as we did in previous weeks, from long to wide format and vice versa. tidyr
takes over where these packages leave off. The official vignette provides an excellent overview of how its functionality will be helpful in cleaning your data.
As you may by now be aware, strings are a very important component of data analysis but can be a very cumbersome one if handled improperly. This is where stringr
comes in. In general, stringr
allows you to manipulate character strings, as well as to extract meaningful information stored in strings. For example, you might want to subset a vector of strings to only include certain values, or you may want to count how many of a certain value are contained in string variables. We have previously covered how to do this kind of manipulation if you are interested in extracting full character strings (for example, every instance of “Afghanistan”). stringr
operates alongside R’s built-in capacity to handle regular expressions in order to perform a more advanced version of this function.
Regular expressions are a series of characters that you can use to format a search. Some of the underlying concepts will be familiar if you have ever used special search terms on, for example, internet searches or library websites, where " " give you items with an exact match, and * serves as a wildcard (so that you can, for example, search for doc* and receive results with both doc, document, documents, documentary, etc.), etc. Regular expressions invoke this type of logic to allow you to more precisely search for terms or character fragments that you need.
The set of regular expressions is fairly long and complex, but some key characters you should know include:
( ) groups characters
[ ] used for “or” conditions
. can match anything
\ is an escape character, so that you can reference punctuation-type characters directly even when they would otherwise be operators
^ searches from the beginning
$ searches from the end
The strings
overview in R for Data Science provides a very comprehensive explanation of all of these and more.
You can access regular expressions outside of the stringr
package simply using base R’s grep
, grepl
, or regexp
. Use this guide to see more examples of how the arguments for these functions will allow you to incorporate regular expressions into your work. Note that these are great for textual analysis, but are also great for systematically renaming variables, variable values, or even files. For example, suppose you read in a series of .csv files at once that you need to merge into a single dataset. Each file represents data from a different country and is named something like datasource-countryname-year.csv
. You can use regular expressions to split this string by -
, select the second element to extract the name of the country from the filename, and use it to populate a column named country
in the final, merged dataset.
This cheatsheet provides a comprehensive list of different parts of regular expressions you can use, and this website will actually allow you to test your regular expressions as you create them, to see highlighted parts of text that the expression will correctly identify.
While technically not part of the main tidyverse
set of packages, you should be aware of the lubridate
package, which uses tidy
-esque syntax to make your life easier in evaluating and converting date objects (which, like strings, can cause huge problems if handled incorrectly).
As mentioned in previous weeks, R often interprets dates as strings. This makes the creation of time-related variables (for example, duration variables) more challenging. lubridate
lets you specify the format of the dates listed as strings or numbers in your data—whether mdy()
or ymd()
or dmy()
—and it automatically interprets them into date objects. These functions are also extensible to include hour and minute timestamps. You can also use the mutate
function in the tidyverse
to convert date/time information stored across multiple variables into a single date/timestamp variable using make_datetime
. Even more explanation is available in the R for Data Science section.
All of the tools and techniques we have discussed so far in the course can be applied in combination to tackle even the most technical of projects. In this segment we will examine just a few advanced applications that weave together knowledge from earlier parts of this class, and we will end with some tips on how to continue learning on your own.
There are several approaches one can take to text analysis in R, ranging from basic text mining to more advanced models using LDA, for example. In this section, we will cover some basic text analysis techniques across several packages to illustrate the difference in approach between using base R and dplyr syntax.
As a preliminary step before even worrying about packages and functions, you should organize your textual data to read into R. Typically this means that you want each document in your “corpus” to be a separate .txt file within a single directory that you can easily reference from within R. This may require data cleaning done outside of R, for example if you need to OCR .pdf documents or download files before converting to .txt. Your data may also be stored in a .csv format, however, with columns indicating things like “author”, “date”, and then one including the full “text” for each document in your corpus already.
For now assume that you have multiple files, each with text in them. Once you have your text in the right place, you will want to load this as a directory:
# For mac
mydirectory <- file.path("~", "filelocation", "mytexts")
# For PC
mydirectory <- file.path("C:", "filelocation", "mytexts")
# Look at the separate files you loaded
dir(mydirectory)
The next steps will illustrate text analysis techniques with standard R syntax using the tm
package. The first step here is to convert your text into a “Corpus”-type object.
# If all documents are separate .txt
mydocs <- Corpus(DirSource(mydirectory))
# If text is a column in a dataframe, read in from a .csv for
# example
mydocs <- Corpus(VectorSource(dataframename$text))
Note here if you’re reading the tm
documentation that Corpus is an alias for VCorpus (a volatile corpus)—either may be used. Also note that when reading in your text data, the default is for documents to be read in English. If your data are in another language, you will have to specify it. For example, if your documents are in French:
mydocs <- Corpus(DirSource(mydirectory), readerControl = list(language = "fre"))
The language options are provided with IETF language codes and adopted through the NLP
package. You can read more about IETF codes here.
Once you have a corpus object, you are ready to clean your text for analysis. The steps to do this with tm
are more fully explained here, but I will highlight the general procedure using a built-in corpus accessible through the tm
package, acq
, which contains 50 news articles about corporate acquisitions from Reuters.
In general, you will want to:
data(acq)
mydocs <- tm_map(acq, removePunctuation)
mydocs <- tm_map(mydocs, removeNumbers)
# Exclude the # sign, replace it with nothing
for(i in seq(mydocs))
{
docs[[i]] <- gsub("\#", "", docs[[i]])
}
mydocs <- tm_map(mydocs, tolower)
tm
should use to find stopwords. If your data are not in English, use the appropriate stopword dictionary instead, but note that the options for identifying stopwords are available in fewer languages.mydocs <- tm_map(mydocs, removeWords, stopwords("english"))
SnowballC
package.library(SnowballC)
mydocs <- tm_map(mydocs, stemDocument)
mydocs <- tm_map(mydocs, stripWhitespace)
mydocs <- tm_map(mydocs, PlainTextDocument)
Once these preliminary steps are completed, your next general task is to create a DocumentTermMatrix object, which will allow you to “quantify” the text in terms of word counts. For idiosyncratic reasons, you may need to create and immediately transpose this matrix–it depends on what information you want. This matrix will also serve as a basis for evaluating differences across your documents, if you like.
mydocsDTM <- DocumentTermMatrix(mydocs)
# If you want to evaluate differences among documents in
# terms of distance:
mydocsDTM.mat <- as.matrix(mydocsDTM)
# You can add back in rownames if you create an object
# listing the names in your text directory
names <- list.files("filelocation")
rownames(mydocsDTM.mat) <- names
In our case, the original articles in acq
just came with numbers, which we can add back in accordingly:
rownames(mydocsDTM.mat) <- names(acq)
library(proxy)
##
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
# Note that to evaluate dissimilarity you will need to
# specify the method for calculating distance--check the
# documentation
dissim <- dist(mydocsDTM.mat, method = "cosine")
# You can also use this information to create a cluster plot
# with a built in plot function, again choosing a method. You
# could also add names to this manually with labels=names if
# you are not using built-in data
hcluster <- hclust(dissim, method = "ward.D")
plot(hcluster)
# You may also care about word frequency
freq <- colSums(mydocsDTM.mat)
ordered <- order(freq)
freq[head(ordered)]
## ability acceptance accompanied acquiring acres acting
## 1 1 1 1 1 1
freq2 <- sort(colSums(mydocsDTM.mat), decreasing = T)
head(freq2, 10)
## said dlrs pct mln company inc shares reut stock
## 186 100 70 65 63 53 52 50 46
## will
## 35
See the vignette linked above for how to graph word frequency using ggplot2. You can also restrict attention only to words that have frequency above a certain threshhold.
You may also be interested in the association between particular words, which is a beginning step toward topic analysis.
# Fill in words in the second argument, set a lower bound for
# correlation
findAssocs(mydocsDTM, "good", corlimit = 0.75)
## $good
## ability accompanied activity along analytical
## 1.00 1.00 1.00 1.00 1.00
## arena believes better big bilion
## 1.00 1.00 1.00 1.00 1.00
## book boost boosting calculated capitalization
## 1.00 1.00 1.00 1.00 1.00
## center command comments components constitute
## 1.00 1.00 1.00 1.00 1.00
## contrary contributing disagreed divestitures dividend
## 1.00 1.00 1.00 1.00 1.00
## elevating endeavors enormous exposure fact
## 1.00 1.00 1.00 1.00 1.00
## fetched fuel globally goes going
## 1.00 1.00 1.00 1.00 1.00
## heavily heavy helped investing jeffery
## 1.00 1.00 1.00 1.00 1.00
## kinds lesser lewis lift lipper
## 1.00 1.00 1.00 1.00 1.00
## mark might others outlined partially
## 1.00 1.00 1.00 1.00 1.00
## percentage performance perrin preclude premium
## 1.00 1.00 1.00 1.00 1.00
## probably questioned realistic reduced road
## 1.00 1.00 1.00 1.00 1.00
## rumor split suggests takes talk
## 1.00 1.00 1.00 1.00 1.00
## thereby theyre theyve travel true
## 1.00 1.00 1.00 1.00 1.00
## tuesday undervalued unhappy valuations variation
## 1.00 1.00 1.00 1.00 1.00
## volume water yesterday reflect market
## 1.00 1.00 1.00 0.95 0.91
## eckenfelder partial place strong operating
## 0.89 0.89 0.89 0.89 0.87
## spinoff higher make since profitable
## 0.85 0.81 0.81 0.81 0.75
If you’re interested in more complex topic model designs, check out the topicmodels
package. There are other packages like tm
that you might want to explore as well, including qdap
. See this vignette for more details.
Much of the syntax in the above functions will be familiar, even if the operations are not. The sort
, head
, and plot
commands, as well as optional arguments and the regular expressions component weave in techniques that you have already learned. Yet as with many tasks in R, there is more than one way to perform text analysis. Many additional packages and functions build on tm
, but another option is to approach text analysis through the language of dplyr
using the tidytext
package.
Here again we will use a built-in dataset, this time with 2,246 Associated Press articles from the topicmodels
package, to walk through the major functions in the package. The same steps apply as above, however, if your data are not already in a corpus format.
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
You can “clean up” these data in a single line of code. Because the AssociatedPress data are already in a DocumentTermMatrix format, you don’t need to separate out by word, but you could do that with unnest_tokens
if your data are not structured this way. Note that unnest_tokens
also allows you to change the ngram for your analysis–that is, how many words or word segments are evaluated by comparison. For example if your ngram is 2, you are evaluating 2-word phrases.
tidied <- tidy(AssociatedPress)
tidied
## # A tibble: 302,031 × 3
## document term count
## <int> <chr> <dbl>
## 1 1 adding 1
## 2 1 adult 2
## 3 1 ago 1
## 4 1 alcohol 1
## 5 1 allegedly 1
## 6 1 allen 1
## 7 1 apparently 2
## 8 1 appeared 1
## 9 1 arrested 1
## 10 1 assault 1
## # ... with 302,021 more rows
Note that using grep
and other subsetting techniques we have explored, you can also narrow down what documents you want to look at within your broader set of texts. For example, if you only wanted to include articles that mentioned government actions or actors, you might want to select based on whether the document includes words like “government” or “president” before you tidy your text.
selection <- "government|president"
final.docs <- mydocs[grepl(selection, mydocs$text), ]
You can perform similar tasks to the ones we did above with tm
using dplyr
syntax, such as removing stopwords, stemming, and obtaining word counts:
data("stop_words")
cleaned <- tidied %>% anti_join(stop_words, by = c(term = "word"))
cleaner <- cleaned %>% mutate(word = wordStem(term))
cleaner %>% count(word, sort = TRUE)
## # A tibble: 6,722 × 2
## word n
## <chr> <int>
## 1 report 1144
## 2 nation 1061
## 3 offici 935
## 4 time 878
## 5 peopl 872
## 6 presid 870
## 7 dai 833
## 8 govern 773
## 9 call 760
## 10 week 748
## # ... with 6,712 more rows
With these cleaned data, and with similar join
syntax, we can also do other operations, such as sentiment analysis. You have several sentiment lexicons to choose from, but for this example we will simply use bing
.
ap_sentiments <- cleaner %>% inner_join(get_sentiments("bing"),
by = c(term = "word"))
ap_sentiments
## # A tibble: 27,068 × 5
## document term count word sentiment
## <int> <chr> <dbl> <chr> <chr>
## 1 1689 bug 5 bug negative
## 2 2141 bug 1 bug negative
## 3 1489 disarray 1 disarrai negative
## 4 1507 disarray 1 disarrai negative
## 5 1631 disarray 1 disarrai negative
## 6 2013 disarray 1 disarrai negative
## 7 2101 disarray 2 disarrai negative
## 8 2219 disarray 1 disarrai negative
## 9 1287 uphold 1 uphold positive
## 10 1394 uphold 1 uphold positive
## # ... with 27,058 more rows
Check out this vignette for more on tidytext
techniques, as well as this guide, featuring Jane Austen novels, for a more in-depth look at how to handle different corpus structures and for discussions of plotting sentiment.
Besides the divide in syntax between base R and the tidyverse
, R can also be used in more advanced applications as a front-end to other software and languages.
The rPython package allows you to run Python code directly through R. As is common with many adaptations of other languages or programs to R, the functions in this program will have syntax from R that runs code from another language or program. This is useful if you need a particular functionality of one program or language, but are most comfortable or familiar with R’s interface and syntax.
RSelenium is a great example of this. If you are not familiar with Python syntax for scraping webpages, for example, or have a more complex scraping project that requires you to manipulate a website at the same time, you can invoke Selenium (a java-based program) through R to do the necessary tasks. Note that to do this, you may want to run a Docker container. Instructions on how to do that are in the linked vignette above.
The Rcpp package allows you to leverage both R and C++ syntaxes and libraries. This can be particularly useful as you optimize computationally intensive model estimation, or for creating your own packages. Read through the explanation in Advanced R for more.
Several different tools exist for Bayesian analysis in R, many requiring external programs. BUGS/WinBUGS/OpenBugs are a set of classic programs for several platforms with its own set of GUIs. If you visit the BUGS site, you’ll notice this warning:
“The programs are reasonably easy to use and come with a wide range of examples. There is, however, a need for caution. A knowledge of Bayesian statistics is assumed, including recognition of the potential importance of prior distributions, and MCMC is inherently less robust than analytic statistical methods. There is no in-built protection against misuse.”
While knowing your data and knowing your method are key in all types of statistical computing practice, it bears mentioning here that relying on packages or programs that automate elements of your analysis also requires you to recognize their functions, limitations, and underlying assumptions.
Note about NIMBLE: NIMBLE is another option for Bayesian analysis that builds on BUGS and allows you to write your own algorithms if you choose. Handle with care.
JAGS works much the same as BUGS and follows similar coding conventions, and at this stage is likely most commonly used. If you are just getting started with JAGS, you will notice that it requires data in a list format, a full model specification, and instructions about the sampling process (burn-in, number of iterations, etc.).
A basic model will take the general form
model{
# where n is number of observations in data
for(i in 1:n){
# each value of the dependent variable will be determined by some density function (e.g., dbern if it is bernoulli) with a given parameter (e.g., for shape or scale of that distribution)
y[i] ~ densityfn(param)
}
# parameter value of the distribution for y has a prior density of its own, with a parameter or set of parameters as well (e.g., dunif(0,1))
param ~ paramdensity(param2)
}
You can specify this model within R itself, or save it out as a .bugs
file that you later call with JAGS:
modelString = "model{
y[i] ~ densityfn(param)
}
param ~ paramdensity(param2)
}"
writeLines(modelString, con = "mymodel.bug")
You can then call this model directly from a .bug
file (or if you saved it as a .jag
file) in your working directory when running JAGS. Here I’m using the run.jags
function from the runjags
package. The arguments are effectively the same if you use rjags
, as I do below.
post <- run.jags("mymodel.bug",
# what values am I interested in for results
monitor="param",
# data must be in the form of a list object: if you have binary or factor explanatory variables, you will want to use model.matrix and specify which variables are "contr.treatment"
data=list(y=y, n=length(y)),
# how large of a sample do you want? the bigger it is, the longer it will take, but complex models may take longer to converge
sample=10000,
# throw away the first several samples with poor mixing
burnin=1000,
# take only a subset of total samples: saves on memory and smooths results
thin=10,
# you can technically run more than one "chain" of this analysis or spread the sampling over cores (parallelization), but you don't need to
n.chains=1)
In terms of preparing your data, as stated above, if you have factor variables for example, you will need to use model.matrix
to expand your grid (that is, to create rows and columns for values of each variable for each level of your factors). Factor variables are “contrasts.” For example, if you had a model predicting voting in terms of age, sex, and party, your data preparation might look like this:
y <- vote.variable
x.expand <- model.matrix(~age + sex + party, data = dataname,
contrasts = list(sex = "contr.treatment", party = "contr.treatment"))
# for modeling, you may need to remove the intercept
x.expand <- x.expand[-1]
Note that age is not a contrast because it is presumed to be continuous.
Using this setup for your data, you want to estimate a logit (voted y/n), so your JAGS model might look like:
model{
for (i in 1:n){
y[i] ~ dbern(p[i])
y.hat[i] <- b0 + b1*x1[i] + b2*x2[i] + b3*x3[i]
p[i] <- 1/(1+ exp(-y.hat[i]))
}
b0 ~ dnorm(0,1.0e-4)
b1 ~ dnorm(0,1.0e-4)
b2 ~ dnorm(0,1.0e-4)
b3 ~ dnorm(0,1.0e-4)
}
In this setup you have one b
for each of your covariates, you are looping over each observation in you independent variables, and you have the p[i]
allowing you to estimate the logit. Choices of priors are up to you–these ones are all the same, but note that you would specify one for each coefficient.
Alternatively, you could collapse some of this with inprod
. Here you will note I have changed to using jags.model
and related functions from the rjags
package. Either syntax or setup will work.
model{
for (i in 1:n){
y[i] ~ dbern(p[i])
y.hat[i] <- b0 + inprod(b[]*x[i,])
p[i] <- 1/(1+ exp(-y.hat[i]))
}
b0 ~ dnorm(0,1.0e-2)
for(j in 1:k){
b[j] ~ dnorm(0,1.0e-4)
}
}
expanded.model<-jags.model("mymodel.bug", data=list(y=y, x=x.expand, k=3, n=length(y)))
update(expanded.model, 10000)
expanded.model.j <- coda.samples(expanded.model, c("b0","b","p"), n.iter=100000, thin=10)
summary(expanded.model.j)
In this case, you have expanded a grid similar to what you did with model.matrix
, so that the set of coefficients you have tracks the number of explanatory variables you include. Note that if you do this, you will want to loop over your b
’s to provide priors for each, assuming each gets the same prior. Also note that in creating the second loop we have said “loop over each j element for values 1:k,” meaning that you will now need to specify k (the number of explanatory variables in your model) when you provide your data list.
Finally, note that you could also skip the messy business about y.hat
and use the logit()
function:
model{
for (i in 1:n){
y[i] ~ dbern(p[i])
logit(p[i]) <- b0 + inprod(b[]*x[i,])
}
b0 ~ dnorm(0,1.0e-2)
for(j in 1:k){
b[j] ~ dnorm(0,1.0e-4)
}
For more example code and some video lectures and slides, check out these resources:
This Github repo with rjags scripts, simulated data, and .bug files for various models
Besides using BUGS and JAGS, another option available to you is Stan, a project headed by Andrew Gelman. Although you can access it through R with rstan
, it features syntax much closer to C++. It can be more efficient for particularly complex hierarchical models and includes the No-U-Turn Sampler (NUTS), an optimization of Hamiltonian Monte Carlo. Stan also has a great set of documentation, including a language manual, examples, and tutorials to help you get started and to compare syntax with BUGS. Stan also has a user community Google group to which you can post issues and questions.
As you begin to explore R more on your own, you will need to adapt to different coding styles and packages in order to do different types of analyses. Learning how to learn is also a key part of the educational process (and certainly a staple of graduate school). Below are some quick guides on how to teach yourself new techniques in R, with examples from survey and Baysian analyses.
Suppose that as a new user of R (useR), you are confronted with a survey analysis project. Taking the tools you have already learned in this class, you feel comfortable reading in survey data and doing basic summary analysis and visualization. When it comes to more complex analyses, however, you need guidance on survey weights and how to handle them, because your results will depend on the weights you choose and the type of sample you have.
If you google “survey weights in R” or “survey analysis R” you will see that a package called survey
has documentation in the first few hits. You will also notice a few posts on R-bloggers and similar sites that provide explainers related to the package, situated in broader context about survey analysis.
Having done this initial search, you have several components of the information you need. First, you discover that the survey
package can help you in creating a design object that you use in later analysis using svydesign()
. With this function, you can directly specify the name of the weights variable if it already exists in the data. You will also specify the stratification of your sample (or samples).
Once you have created a design object, you will use the built-in functions in survey
to do your basic analysis. These functions will take you far if you are manipulating data collected by other researchers. But what if the survey data are your own?
Here you may need to calculate weights yourself. To do this, you will need to use the rake()
function in survey
, which requires both sample and population margins.
Suppose you had voting data for individuals and want to compute weights by party (R, D, I) and income bracket (High, Medium, Low). Take note of the information that blog posts and the package documentation take into account. You can do this primarily by investigating the arguments of functions and the type of information they take.
In this case, both of these variables of interest are factor variables with 3 levels each and you will need relative frequencies for each level. Following the example in this R-bloggers post, you would create dataframes relevant to population characteristics accordingly:
distr.party <- data.frame(party = c("R", "D", "I"), Freq = nrow(mydata) *
c(0.45, 0.5, 0.05))
distr.income <- data.frame(income = c("High", "Medium", "Low"),
Freq = nrow(mydata) * c(0.01, 0.6, 0.39))
Now you will use rake()
to relate these values to your sample.
raked.leaves <- rake(design = mydata.nowt, sample.margins = list(~party,
~income), population.margins = list(party.dist, income.dist))
Unlike the implementation of packages in other programs (cough STATA), you may need to examine and adjust these weights to be within reasonable bounds. This again requires knowing your data.
Note that survey
has its own syntax for estimating models using survey design objects, for example svyglm
instead of regular glm
. Look at an example from the package documentation:
data(api)
dclus2 <- svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
model0 <- svyglm(I(sch.wide == "Yes") ~ ell + meals + mobility,
design = dclus2, family = quasibinomial())
summary(model0)
##
## Call:
## svyglm(formula = I(sch.wide == "Yes") ~ ell + meals + mobility,
## design = dclus2, family = quasibinomial())
##
## Survey design:
## svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.82118 0.76726 1.070 0.2916
## ell -0.05462 0.02220 -2.461 0.0188 *
## meals 0.02873 0.01820 1.579 0.1232
## mobility 0.01460 0.03017 0.484 0.6313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9856512)
##
## Number of Fisher Scoring iterations: 4
You will notice that the syntax looks very similar to what you are familiar with from our discussion of glm
: you specify a dependent variable ~
independent variables, the design object where previously you specified data, and the family of model you would like to estimate (in this case a quasibinomial).
You may not be familiar with the I()
notation used to specify the dependent variable. Do not panic. Recall our previous discussion that you can modify variables in place within model commands in R, just as we did with log(y)
. If you search for I() function in R
you will easily find documentation to inform you that I()
is an inhibiting function that simply “protects” dataframes so that strings are not converted to factors, for example.
Now that you have some basic information about the relevant package and how it functions, you may have more specific questions that pertain to your data. Perhaps you need more information about cluster bootstrapping for resampling. These are questions you can take to Google again, but you might also benefit from taking them directly to specialty sites like Stack Overflow. In doing this, you might find Thomas Lumley’s (the survey
package author) page, and will likely find his slides covering more complex applications of the package than are covered in the R-bloggers posts. You will also discover there is a book, “Complex Surveys: A Guide to Analysis Using R” that Lumley wrote that you could reference. [ Note: Because R is open-source, be wary of paying large sums of money for books on its functions or packages—many resources exist for free!]
Now that you have completed basic analysis and found a few preliminary answers to your initial tasks and questions, iterate the search process. Try sample code from one site, use examples from a vignette related to a package, and google any errors you receive and read about other users who have had those issues on Stack Overflow. At this point, if not sooner, you may go beyond the lonely internet life to reach out to friends, colleagues, and professors who may have done similar work.
If no one is available locally to help, and answers on the internet are sparse, but you see a technique referenced in a paper, you may also be able to directly find code for your problem on an author’s personal webpage or Github repository. You can also perform a general search of Github repos to look for code directly (try searching cluster robust standard errors
).
Step 4 won’t be for everyone, but in a pinch you can often write your own functions to solve problems that are not well-managed by existing packages. Some examples might include things like cluster robust standard error calculations for complex, rare model types; uncommon types of cross-validation; particular estimators or samplers. For many of these, the internet will still be your best friend as you are (often) converting math to code. Many times, investing in something of your own design is not necessary because current packages have adequate functionality, but if you find yourself doing the same type of analysis routinely and needing something beyond the prefab, coding your own solution can often be efficient in the long-run. This is how packages are introduced in the first place.
The survey example seems relatively straightforward: identify the relevant package for your needs and use internet resources to help you implement its functions. But what if there are multiple such packages: how do you choose?
In addition to complementary programs like JAGS
and stan
that can be used with an R interface, there are several packages designed to do Bayesian estimation without needing to fully mathematically specify a model. If you are familiar with Bayes, you can imagine why this creates more need than ever to truly understand what kinds of assumptions a package operates under and what functions go on under the hood.
Here again you may begin by simply googling "Bayesian packages R"
or a specific task in Bayesian analysis you want to complete. If you are already aware multiple packages exist, however, (or become aware and overwhelmed quickly by googling), another option is to check out the Cran Task View. The Task View does not list many categories, but for each included category it provides an overview of common packages and their relative advantages. In the Task View for Bayes, you will see a host of packages specifically designed for Bayesian analysis, as well as a series of packages that include functions for particular modeling challenges. If you have a specific task in mind, you can easily search this page for mentions; otherwise, everything is in alphabetical order. Often if you are looking to do Bayesian analysis in a canned routine through one of these packages, your problem is already well-defined by the type of model you are trying to estimate or the type of data you have at your dispoal—these will also help you to narrow the set of possible candidates.
While you should always prioritize the package that has functions most suited to your needs, having a mental catalog of general functions that cover your area of interest will help you to identify possible places to start in the future as new analysis tasks arrive. This section will highlight just a few useful packages to keep in mind.
The arm package is a great general package useful for both MLE and Bayes, with code that accompanies Gelman and Hill (2007). Specifically relevant to Bayes, arm
includes the bayesglm()
function. When investigating this function, take note of its default behaviors. For example, prior means default to 0 if not set manually. If you specify only one, but have several predictors (independent variables excluding intercept), it is assumed that that single prior mean is the same for all predictors. Otherwise, you need a vector of prior means of length equal to the number of predictors.
MCMCpack contains a number of model types and postestimation analysis functions for Bayes. Andrew Martin and Kevin Quinn provide a set of slides explaining its functions and relative advantage versus using something like BUGS. MCMCpack uses coda
as a basis for posterior analysis, so you will need to use extra documentation to understand its full range of functionality. It does include built-in functions for calculating Bayes Factors and posterior model probability. Also note that while functions like MCMCregress
do not require you to specify priors in quite the same way as JAGS, say, and that might be more comfortable, it does require you to think about how you want to calculate the marginal likelihood (using the argument marginal.likelihood=
) if you want to deviate from the default.
MCMCglmm includes both standard glm’s estimated with MCMC as well as some rarer types that you might be interested in (e.g., zero-inflated and multi-trait models). While similar to lme4 in several ways, MCMCglmm also handles situations where packages like lme4 fail to converge.
lme4 is another general package for glm’s, in particular emphasizing multilevel models, and several Bayes-specific packages require it as a dependency or extend on its functions. This vignette gives a good overview of the differences in lme4’s syntax and approach relative to base glm specifications.
In addition to useful functions for hurdle and zero-inflated models in MLE, pscl also includes Bayesian IRT.
MNP is specifically designed for the implementation of multinomial probit models.
Bolstad provides the bayes.lm
function for fitting Bayesian linear models.
blme contains the functions blmer
and bglmer
for doing linear or glm mixed-effect models. The syntax here is much the same as you would find in lm
, lmer
, or glmer
but the functions in blme
allow you to conduct Bayesian or penalized MLE analyses.
brms is a newer package building on Stan that uses syntax similar to lme4. It includes less common models, such as zero-inflated or survival models. It also allows you to do posterior prediction and crossvalidation.
In the coming weeks, you will learn many of the same techniques we have discussed in R for STATA. Learning both will give you the option to select whichever is most comfortable and convenient for you going forward, but note that their functionality on some dimensions may differ. Some methods or routines have more developed packages for STATA than for R and vice versa, often as a function of departmental conventions (where, e.g., sociologists may be more apt to use STATA). If you are comfortable with the syntax of both and discover that there are not many options for the analysis you want to conduct in R, you may also want to explore STATA code for similar functions. For example, code for survey analysis and for time series analysis is very well developed in STATA, whereas the R packages associated with each may require you to do many of your own calculations or much more data transformation than you would prefer.