Introduction to R

Welcome to the R segment of PS 553. We will spend the next several weeks learning basic coding techniques, data management, and analysis processes in R.

There are several programs to choose from when doing statistical analyses in the social sciences, including STATA, SAS, SPSS, and Matlab, but R has several distinct advantages that make it useful to learn. First and foremost, R is free, and is designed specifically for conducting statistical analyses and managing data, unlike more general-purpose programs like Python. The R project is open-source, which allows for contributions from scholars and coders around the world and which makes programming your own routines easier. R is also known for having fantastic graphics that you can use in your own data visualizations.

These characteristics set R apart from another strong contender in statistical computing: Stata. Stata is more expensive, but has a much less steep learning curve. Its use is more common in fields like economics and sociology, but its routines are more difficult to modify and make your own. In this class, you will learn standard techniques in both R and Stata so that you are fluid in both and can choose for yourself which you find most useful.

Why Use R?

Besides being free and open-source, R is a great resource for conducting social science research and manipulating data. Having programming abilities in general is a necessary skill for conducting quantitative research, but learning R in particular can be useful for completing coursework, collaborating with other researchers, and creating documented and reproducible research products. The automation of analyses will save you time and energy, and the documentation that R forces you to make aids in the scientific project of reproducible research that can be built on by others. Programming in a clean, efficient way is also the key to conducting precise analyses and correcting mistakes. While R requires some upfront investment in order to learn, the process of learning a programming language by itself can aid in later efforts to learn additional programming skills and techniques. In particular, R is an object-oriented programming language, and relies on this architecture much more than other statistical programs. This makes R similar to very commonly used programming languages like C++, Java, Ruby, and Python and will enable you to transition your skills later if your methodological interests develop. Much like learning French is more of an investment than staying with English, its similarities with Spanish or Italian may mean that your investment pays additional dividends down the road. [For the nerds among you, object-oriented programming as a (newer, younger) paradigm could be contrasted with the procedural programming emblematic of C and Fortran.]

For your purposes in graduate school and in conducting research, R is going to help you manipulate, analyze, and visualize data. All of these are key components of quantitative research and engaging with scholarship in the social sciences. This course emphasizes learning both R and Stata both because personal preferences differ and because some projects may be easier to conduct with one versus the other. For example, time series methods are more routine and packaged in Stata than they are in R, but more complicated, modular modeling structures are easier to construct in R than they are in Stata.

Downloading R and Getting Started

In order to get started, you first have to download R. To do this, you will need to select a “mirror” on CRAN—feel free to pick any one you like, but it typically makes sense to pick one in your same country or one close to you. This is just the online “location” from which you will download R. R is supported on Windows, Mac, and Linux. With this base download, you will be able to open an R “console” into which you can type commands.

Once you have R downloaded, you may also choose to download the popular R IDE (integrated development environment), RStudio. This program will allow you to have the console, your script file (more on this later), and help panes within the same interface. It also helpfully includes some menus and buttons if you prefer to have a gentler introduction to using statistical software for the first time, particularly if you are coming to R from a program like SPSS. Even for more advanced users, RStudio can be useful in giving you a platform for using knitr, rmarkdown, and shiny.

R’s base functionality is decent, but for your own research you will need to use “packages” that are developed by fellow users. These are also (usually) on CRAN. Later we will talk about how to identify which packages you need and how to load them.

Resources

The internet is truly your best friend as you start to learn R and troubleshoot issues with your code. While there are great resources built in to R to give you help with particular functions or packages, the open-source nature of R means that a large online community exists to address your questions as well. As you are learning, these are some resources you may want to consult:

R Script Files

Because your code is so critical to allowing other researchers to replicate your analysis, creating a document that records your code is key. While you can simply type commands into the R console and run your analyses that way, that method means you will not have a record of your work. A “script file” in R will allow you to document the steps you took to conduct your analyses. A script file is effectively just a plain text document containing your code. To create these files, you can use the script file interface built into R or RStudio, or you can use an external text editor, which sometimes have nicer features and syntax highlighting. You do not need to pay for a text editing program in order to get great features. Some free options I recommend (in no particular order) are Textmate, Notepad++, and Sublime Text. You can also use very powerful but more complex environments like Emacs (with Emacs Speaks Statistics, ESS) or Vim. All of these programs have functionality that make them work well with your R console, but you do not need anything fancy. You can simply use a built-in notepad program if you prefer.

As you create your script files, here are a few guidelines to follow:

  1. Plan ahead
    • Outline your project
    • Write pseudocode
    • Search online for prior work
    • Use code you wrote already
  2. Identify tools/packages you need

  3. Make code modular

  4. Comment your code

  5. Save your work!

Procedurally, your R script file will typically contain the following elements:

  • Title, author, date
  • Working directory
  • Packages
  • Data inputs
  • Analysis
  • Outputs (graphics, figures, tables)

Commenting your code

Commenting your code is what will allow future you and future researchers to know how and why you did what you did.

In R, you can write comments directly into your script file using #. With syntax highlighting enabled, you will be able to easily see comments relating to particular blocks of code. Lines that begin with # will not be executed by the program.

# I am using R as a calculator.
2+2
## [1] 4

You do not need to comment every line of code you write, but commenting can provide an explanation for why you are doing what you are doing. As a general rule, when you comment your code you should describe what you are doing, but in a way that extends beyond what is already obvious from the code itself. For example:

# Calculating 2+2
2+2
## [1] 4
# Calculating how many cookies I will eat today, morning and night
2+2
## [1] 4

Commenting can also be used to “toggle” versions of code that you are working on. For example, if you are testing different ways of graphing your data, you might write more than one version of the code, and while you settle on one version for one paper or assignment, you may want to keep the other versions you wrote for later use as well.

Even more fundamentally, you can use commenting to provide basic information as you begin a new script file.

# Script file to do my final project
# Author: Me
# Date: Today

Pseudocode

As you learn R, a useful way to approach writing your own code is to start with “pseudocode.” Pseudocode uses regular language and the syntax or structure of a programming language to help you think about and outline the steps you need to undertake your analysis. You might use commands or function names from R but may not bother with symbols, like { or [ or , .

As you write pseudocode, discipline yourself to be consistent in the words or patterns you use. Much like learning a foreign language, at first you will write down what you want to say in “English,” then translate each word or line to be in the target language. Eventually you will feel comfortable enough to write directly in the syntax used by R. For example, you might write:

Start function
Input information
Logical test: if TRUE
  (what to do if TRUE)
else
  (what to do if FALSE)
End function

Saving your Work

When conducting research, keeping all of your code, data, and files in the same place is useful. Many journals now require that you make, e.g., your data and code publicly available. Now is the time to invest in file structures and versioning programs (e.g., Dropbox and Github).

Not only should you save your script file often to prevent loss of your work, you may also want to save your workspace in R in order to save time. If you do this, you will be able to load your console in the future as though you had already completed all of the operations that you ran from your script file. You can do this in the File menu in both R console and RStudio, or use save.image(). While in general this is not necessary, it can save you a lot of time if you did particularly tedious simulations or large-scale data analyses.

# Save an .RData file of your workspace
save.image()

# Load your old workspace
load("mywork.RData")

You may also want to “save out” modified datasets or graphics as you go through your work. Never save over your original dataset.

In thinking about your file structure and organizing your work, consider creating a new folder for each new “project.” This can serve as your “working directory” in R. Your working directory is the place on your computer (or the cloud) that R will look to input data and output files or graphics. You should be sure to set a working directory at the beginning of each session by providing R with a filepath that it will use to locate your project folder. Note that how you set the working directory will differ depending on your operating system.

# Check current working directory
getwd()

# Mac/Unix
setwd("/Users/me/Dropbox/myproject")

# Windows
setwd("C:/Me/Dropbox/myproject")

Getting Started

Loading Packages

Now that you have R and a script file system up and running, you are ready to start your analysis. First you will want to load the packages that you need in order to do the operations you want.

When you first load a package, you will need to install it from CRAN. This will require you to select a mirror as you did when you first downloaded R.

install.packages("MASS")
install.packages("foreign")

Once you have initially installed the package, you will need to call it in each additional script file where you want to use it.

library(MASS)
library(foreign)

Note that if you update your version of R, you will often have to re-install packages as well.

R comes with several packages pre-loaded, but you can always check which packages are currently installed.

library()

Also note that sometimes packages depend on each other. If you load one package, it may also load “dependencies” that allow it to function. By the same token, several packages might have functions using the same name. This means that whichever package you load last will “mask” that function from other packages. R will notify you that it has masked a function so that you know which package’s function is currently in use. If you want to use two functions of the same name in different packages that you’ve loaded, you can do that by using packagename::functionname.

Getting Help

R can seem complex, but it also provides many built-in help functions, particularly when you are using new packages.

To receive help with a specific package, you can:

help(packagename)
# or
?packagename

If you need more general help with a function or do not know which package to search for, you can instead enter:

??searchterm

Google and Stack Overflow are also excellent resources in this regard.

Creating objects in R

Data can be entered directly into R or loaded from external sources. To get a sense of how R handles different data inputs, we will begin with entering data on our own.

First note that, as above, R can be used as a very fancy calculator without creating any “objects” at all.

10-9
## [1] 1
8^2
## [1] 64
sqrt(96)
## [1] 9.797959
log(2)
## [1] 0.6931472

As our analyses increase in complexity, however, we may want to store values from particular operations into “objects,” which we can then call later for additional analysis.

A <- 1+2
B <- log(100, base=10)
C <- 15
D <- (A + B)/C

Note that in order to store values to objects, we use the <- syntax, which in English roughly means “A gets 1+2.” There are many operators such as <- that you will use for math, assignment, and selection in R. We will cover most of the important ones as you need them, but if you ever want a sort of “dictionary” to various options, CRAN hosts a great set of documentation on R Syntax.

The type of data we enter can differ, however. As Hadley Wickham describes in greater detail in Advanced R, we can think about the different data types that R handles in terms of their dimensionality. Single dimension data types will be (atomic) vectors or lists, whereas two-dimensional types will be matrices or dataframes. This generalizes to n dimensions with arrays. Within each of these dimensionalities, data may still be of different types. Data can be numeric, string, factor, etc. Knowing what type of data you are using and how it is organized is critical to using functions and conducting your analysis.

We can input different types and dimensions of data directly into R.

scalar <- 10
vector <- c(1,2,3)
friends <- c("Amy", "Becky", "Cassandra")

vector.1 <- seq(0,10, length=5)
vector.1
## [1]  0.0  2.5  5.0  7.5 10.0
vector.2 <- seq(0,10, by=1)
vector.2
##  [1]  0  1  2  3  4  5  6  7  8  9 10
vector.3 <- rep("me", 5)
vector.3
## [1] "me" "me" "me" "me" "me"
vector
## [1] 1 2 3
friends
## [1] "Amy"       "Becky"     "Cassandra"
# Be careful of case!
Vector
## Error in eval(expr, envir, enclos): object 'Vector' not found
matrix.1 <- matrix(vector, nrow=3, ncol=2, byrow=F)
matrix.1
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
# Argument order matters: first is what fills the matrix, second is number of rows,  third is number of columns
matrix.2<-matrix(1,3,2)
matrix.2
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1
matrix.3<-diag(5)
matrix.3
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1
# Creating a matrix from vectors
r.1 <- c(1, 2, 3)
r.2 <- c(4, 5, 6)

new.matrix <- rbind(r.1, r.2)
new.matrix
##     [,1] [,2] [,3]
## r.1    1    2    3
## r.2    4    5    6
c.1 <- c(1, 2)
c.2 <- c(3, 4)

newer.matrix <- cbind(c.1, c.2)
newer.matrix
##      c.1 c.2
## [1,]   1   3
## [2,]   2   4
# This generalizes to arrays
new.array <- array(c(1,2,3), dim=c(3,1))
new.array
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

Having created all of these objects to store our data, we can also direct R to specific elements of larger objects.

friends[3]
## [1] "Cassandra"
# Report all elements excluding element 2
friends[-2]
## [1] "Amy"       "Cassandra"
# Row 2 of matrix.2
matrix.2[2,]
## [1] 1 1
# Element in the second row, first column of new.matrix
new.matrix[2,1]
## r.2 
##   4
# Determine the length of a vector
length(vector.1)
## [1] 5
# Determine dimensions of a matrix
dim(matrix.2)
## [1] 3 2

If you feel at any point that your workspace is too cluttered with objects, you can remove them selectively or in total:

# Remove three matrices
rm(matrix.1, matrix.2, matrix.3)
matrix.1
## Error in eval(expr, envir, enclos): object 'matrix.1' not found
# What still exists in our workspace?
ls()
##  [1] "A"            "B"            "C"            "c.1"         
##  [5] "c.2"          "D"            "friends"      "new.array"   
##  [9] "new.matrix"   "newer.matrix" "r.1"          "r.2"         
## [13] "scalar"       "vector"       "vector.1"     "vector.2"    
## [17] "vector.3"
# Remove everything in the workspace
rm(list=ls())

Sometimes, having entered data in a particular format is not compatible with a function or package we want to use for analysis. This requires first knowing the class of the objects we have entered (the type of data they have and the structure of those data) and then knowing what form we need to transform them into. Transforming types is known as coercion.

You can check the current type/class of objects in R before transforming them to meet your needs.

my.matrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=F)
my.matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
class(my.matrix)
## [1] "matrix"
new.df<-as.data.frame(my.matrix)
class(new.df)
## [1] "data.frame"
my.new.matrix <- as.matrix(new.df)
class(my.new.matrix)
## [1] "matrix"

Note here that we are transforming a data structure we have already explored—a matrix—into a dataframe. Dataframes will be the most standard way of storing your data in R. As we did above in constructing matrices, you can use cbind() and rbind() to construct and combine dataframes.

Data may be of differing types—logical, numeric, integer, double, character—that you may need to transform as well. In general, transformations will take the form as.[type].

logical <- c(TRUE, FALSE, TRUE)
num <- as.numeric(logical)
num
## [1] 1 0 1

Note that this will be particularly important for altering data structures like date formats and for creating factor variables for regression models in the coming weeks.

Basic functions and operations

Mastery of conditional statements, for loops, and functions will give you a foundation for manipulating your data and for tackling more complex data cleaning and modeling challenges.

Conditional statements

Conditional statements rely on a logical test to perform an operation. For simple conditional statements, you can use ifelse().

big <- 20
small <- 5

ifelse(big > 10, "Yes", "No")
## [1] "Yes"
ifelse(small > 10, "Yes", "No")
## [1] "No"
# You can also create more complex if statements

if(big ==15){
  tiny <- 2
} else {
  tiny <- 3
}

tiny
## [1] 3

Loops

Loops help us to apply an operation or operations to a series of values or variables.

new.vector <- NULL
for(i in 1:5){
  new.vector[i] <- i
}
new.vector
## [1] 1 2 3 4 5
for(i in 2:3){
  new.vector[i] <- 0
}
new.vector
## [1] 1 0 0 4 5

Note that operations conducted in loops can often be done efficiently with apply() functions.

Functions

You can also use R to write your own functions. These examples may seem simple now, but could save you a lot of time in transforming variables later.

# This function will take two inputs, x and y
my.function <- function(x,y){
  5*x+y
}
my.function(2,1)
## [1] 11

Data Management in R

Loading Data

Now that you have seen how R organizes data that you input yourself, you are ready to see how data can be read in to R either from pre-constructed datasets internal to R or from external sources.

If the dataset is internal to R, in general you will use the format dataset <- data(datasetname). For example:

trees <- data(trees)

While not suitable for doing your own research projects, the datasets that exist within R can be useful tools for learning new coding techniques. You can see what other datasets exist in R using data().

For datasets not within R, however, you will in general use the library(foreign) package to read them in. This package contains functions suitable to several different data formats, including popular ones like csv (where you will need to specify the delimiter and whether there are headers) and Stata .dta files.

For example, suppose that you have downloaded the latest Quality of Governance dataset as a Stata .dta file. Once you have loaded the foreign package and set your working directory, you can read in these data with read.dta.

qog <- read.dta("qog_std_cs_jan16.dta")

If instead you had downloaded the .csv version of these data, you might enter:

qog <- read.csv("qog_std_cs_jan16.csv", header=T)

There are options to support a variety of data formats, but some options provide more support than others. For example, the function read.spss() was created over a decade ago, and has not provided support for format updates as SPSS has evolved. This means you may receive warnings when reading in .sav files created in newer versions of SPSS. Furthermore, SPSS is proprietary software and the functions to read these files rely on what can be gleaned from the open-source sister program, PSPP. Take this as an object lesson in the value of open-source. Note, though, that you are receiving warnings but not errors. Warnings should not interfere with your analysis, but they certainly are annoying. One option to attempt to circumvent this is to re-save your data in .csv format. To do this, you can download PSPP (for free), and use it to open your .sav file. Then follow these instructions to save a new .csv version of your data.

Once you have your data saved to an object, you can move on to editing and cleaning the data. In general you can do this by referencing variables within the data, but some people prefer to attach data to the search path. For example, to work with the “trust” variable from the World Values Survey that is contained within QoG, you could either:

# Print values for WVS trust variable
qog$wvs_trust

# or

# Attach the data and reference variables directly
attach(qog)
wvs_trust

Be cautious if you choose to attach data. Attaching can make it challenging to keep track of objects in your workspace, and if you make edits to the data after attaching it, those will not always be reflected in the attached version.

Data cleaning, merging, and appending

Before you edit your data, you should examine it. Two key functions you will likely use will give you a summary of your data by variable, and will give you the variable names in your dataset, respectively.

summary(qog)
names(qog)

Creating variables

In the same way that using $ let us specify an existing variable in the dataset, it can also allow us to create a new variable. In general you will use the syntax

dataset$new.variable <- values

For example, you may want to transform or rescale an existing variable—you do not want to write over the original, but rather want to add a new, rescaled version:

qog$wvs_trust_rsc <- qog$wvs_trust/2

Dropping and keeping variables

You can use similar logic to begin to subset your data to include only variables that are relevant to your research question.

# Select variables to keep
selected <- c("ccode", "cname", "ajr_settmort")

# Subset your data
qog.subset <- qog[selected]

# Select variables to drop
dropped <- names(qog) %in% c("wr_nonautocracy", "wr_regtype")

qog.subset.2 <- qog[!dropped]

Note that you can also subset data using subset().

qog.subset.3 <- subset(qog, wvs_trust==max(wvs_trust, na.rm=T))

Missing values

R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.

You have several options for dealing with NA values.

  • na.omit() or na.exclude() will row-wise delete missing values in your dataset
  • na.fail() will keep an object only if no missing values are present
  • na.action= is a common option in many functions, for example in a linear model where you might write model <- lm(y~x, data = data, na.action = na.omit).
  • is.na allows you to logically test for NA values, for example when subsetting

Merging and combining data

You may want to draw on data from multiple sources or at differing levels of aggregation. To do this, you first must know what your data look like and know what format you ultimately want them to take. Whatever you do, do not attempt to merge datasets by hand—this leads to mistakes and frustration.

There are two broad categories of combining datasets:

  1. Appending
    • Adding new rows or observations
    • New data may or may not have the same variables/columns as old data
  2. Merging
    • Adding variables/columns
    • New variables may match current observations or not (level of aggregation)

We already discussed a simple way to append data in R: use the rbind() command. Note that to do this, data must have the same variables in the same format, or you must do this cleaning prior to using rbind(). In particular, be careful if variables in one dataset are of a date format whereas in another they are a string. The variables do not need to be arranged in the same order across datasets, however.

Merging requires more attention to how the datasets will be matched using identifier variables. Often these identifier variables are clear based on the structure of the data and your research question, but complications can still arise. For example, if I wanted to merge the QoG dataset with a World Bank dataset and align the two sets of data by country, I might find that one lists North Korea as “North Korea” and another as “Korea, North” or “Korea, Democratic People’s Republic.” These differences can make merging tricky.

In general, however, you will be able to use the merge() command to combine datasets.

new.data <- merge(dataset.1, dataset.2, by="IDvariable")

new.data.2 <- merge(dataset.1, dataset.2, by=c("IDvariable", "year"))

These operations default to dropping observations that are not matched in one dataset. Optionally, you can add all=TRUE to keep all of the data, or all.x=TRUE or all.y=TRUE to keep observations only in one of the two datasets.

To our North Korea example from above, you will be able to match variables with differing names, but if multiple matches exist, each will be given a separate row in the new dataset. Be careful because R’s merging is not idiot-proof: if you provide duplicated identifying variables it will not throw an error, but it will make a mess. You may also run into trouble if your identifying variable is in different formats across datasets (e.g., string and date), or if duplicate observations exist or the identifying variable is not unique.

Reshaping data

In addressing your research question, an important consideration will be unit-of-analysis. You may have hierarchical data (observations belong in groups or classes, such as cities in states or individuals in households) or panel/longitudinal data (unit of analysis over time, such as financial dividends by year). With hierarchical data, you can choose a unit of analysis that is appropriate to your question. With panel data, you may choose to focus on cross-sectional variation (differences across units) or within-unit variation (differences over time in the same subject). How you want to address your question and at what unit of analysis will impact how you want to structure your data.

In particular, this will influence whether you want to store your data in wide or long format.

  • Wide format: each grouping, e.g., in hierchical data, is its own row and within-unit variation is expressed by variables

  • Long format: each individual or person-year is a row, so more aggregated units are multiple rows

Wide format:

idvar income2008 income2009 income 2010
1 100 99 101
2 87 88 89
3 94 110 80

Long format:

idvar year income
1 2008 100
1 2009 99
1 2010 101
2 2008 87
2 2009 88
2 2010 89
3 2008 94
3 2009 110
3 2010 80

R will allow us to switch between data of these differing formats.

long.format <- reshape(wide.format, 
                 # variable names for level one
                 varying=c("income2008", "income2009", "income2010"),
                 # new var name for long data
                 v.names="income",
                 # name for year in long data
                 timevar="year", 
                 # possible time values
                 times=c(2008, 2009, 2010), 
                 # unit-year observations
                 new.row.names=1:1000,
                 # direction to reshape
                 direction="long"
                       )

wide.format <- reshape(long.format,
                       # time variable
                       timevar="year",
                       # variables not to change
                       idvar="idvar", 
                       # unit-year observations
                       new.row.names=1:1000,
                       # direction of reshape
                       direction = "wide"
                       )

If this seems cumbersome it’s because it is, but there are some alternatives that are slightly more automated. Check out the melt() function for more reshaping fun.

Collapsing data

Sometimes you will have data that need to be aggregated to conduct a higher-level analysis. Typically this means you are interested in describing effects related to a group-level variable (e.g., do PhD students who go to the best programs land the best jobs?). This may also arise when your data are across different levels of analysis, as in the case where you have precinct-level voter turnout data but county level demographics.

This kind of “collapsing” of data can be implemented by the doBy package.

library(doBy)

# Create a new dataframe that collapses data for the x and y variables to the region level
collapse <- summaryBy(wdi_inflation + wdi_homicide ~ ht_region, FUN=mean, data=qog, na.rm=T)
collapse

Saving transformed data

Once you have transformed your data and edited variables, you may want to save your new dataframe as an external data file that your collaborators or other researchers can use. As with reading in data, for saving out data you will need the foreign package. Similarly, you can choose from a number of different data formats to export to as well.

Most commonly you will want to save out data as a .csv or a tab-delited text file.

# Generic exporting uses write.table, specifying the dataframe or object you want to export, the file path you want to export it to (full file path if not in your working directory), and that the file will be tab-delimited
write.table(new.data, "c:/newdata.txt", sep="\t")

# Write to .csv
write.csv(new.data, "newdata.csv")

You can use similar syntax, however, to write out Stata or SPSS files.

# Stata file
write.dta(new.data, "c:/newdata.dta")

# SPSS
write.foreign(new.data, "c:/newdata.txt", "c:/mydata.sps",   package="SPSS")