Basic Analysis and Data Visualization

The best way to begin understanding and analyzing your data is to visualize. Graphics can be powerful and persuasive even without conducting in-depth statistical analyses, and they can also give you necessary information about the structure of your data to help you make modeling choices.

Understanding your data

Each time you begin an analysis with data, address a few key questions:

  • What is the unit of analysis?
  • How many observations are in the data?
  • How much variation exists in the data?
  • What are the units of measurement in the data?
  • What values do variables take?
  • How much of the data are missing?
  • How are the variables related? (Are they correlated?)

Answering these questions typically requires looking at the summary statistics for our data—something that we perhaps quietly did in the course of manipulating our data but now can return to with more care and attention. Summary statistics will provide several critical pieces of information about our data: measures of central tendency (e.g., mean, median), measures of dispersion (e.g., variance, standard deviation), and measures of distribution (e.g., skew, kurtosis, quantiles).

Use the Animals data in R to practice accessing summary statistics.

library(MASS)
data(Animals)

summary(Animals)
##       body              brain        
##  Min.   :    0.02   Min.   :   0.40  
##  1st Qu.:    3.10   1st Qu.:  22.23  
##  Median :   53.83   Median : 137.00  
##  Mean   : 4278.44   Mean   : 574.52  
##  3rd Qu.:  479.00   3rd Qu.: 420.00  
##  Max.   :87000.00   Max.   :5712.00

You can also use more specific commands to receive more detailed information about your data.

# Find the 80th percentile for animal body weight
quantile(Animals$body, .8)
##   80% 
## 525.8
# What are the units for body weight? Check the package information
?Animals

# Find the number of observations in Animals using the structure command. Note this also gives you information about the class of the object and the types of variables
str(Animals)
## 'data.frame':    28 obs. of  2 variables:
##  $ body : num  1.35 465 36.33 27.66 1.04 ...
##  $ brain: num  8.1 423 119.5 115 5.5 ...

You may also want to explore the relationships between variables in your data. For example, you could investigate the correlation between body and brain weight in Animals.

cor(Animals$brain, Animals$body)
## [1] -0.005341163

For a more detailed look at what values of our data seem to go together, you could also use crosstabultion. There are several ways to do this in R. Use the UScereal data in R to practice crosstabulation.

library(MASS)
data(UScereal)

# One option is to use table
crosstab.1<-table(UScereal$mfr, UScereal$shelf)

# Another option with slightly different syntax is xtabs
crosstab.2<-xtabs(~UScereal$mfr+UScereal$shelf)
ftable(crosstab.2)
##              UScereal$shelf  1  2  3
## UScereal$mfr                        
## G                            6  7  9
## K                            4  7 10
## N                            2  0  1
## P                            2  1  6
## Q                            0  3  2
## R                            4  0  1

For other options and packages to explore your data, try these resources:

Basic Data Visualization

One of the many advantages of R is its flexible and extensive set of graphical capabilities. However: with great power comes great responsibility. This section will include some common ways of summarizing your data visually, but some graphical choices will be more informative than others, and some methods of visualizing your data will be actively misleading. See this excellent post for a visual representation of exactly how much choice is involved in representing our data. For a concise overview of how graphics can help or hurt your cause, check out Ken Rice’s slides on data summaries.

For each of these examples, use one of R’s built-in datasets and then try with your own data.

Note: If you are using the R console, by default R will open a graphics viewing window when you create your first plot. Each later plot command you run will then overwrite the prior plot. You can save plots to objects just like you can vector values or dataframes (myplot <- plot()). You an also manually open new viewing windows before running a new plot command (windows() on Windows, quartz() on Mac).

Scatterplots

Scatterplots provide a simple way of examining the relationship between two variables—a more visually compelling kind of crosstabulation, if you will.

data(trees)

plot(trees$Height, trees$Volume)

Histograms

Histograms provide useful information about the distribution of variables.

data(trees)

hist(trees$Height)

Kernel Density Plots

Kernel density plots provide a smoothed visualization of the distribution of a variable.

data(trees)

tree.density <- density(trees$Height)
plot(tree.density)

Bar Plots

Bar plots help to summarize data values (e.g., count, minimum, maximum, mean, etc.) using bars that can be stacked, grouped, or sorted.

data(UScereal)

# Make a bar chart showing the frequency of display shelf by manufacturer
freq.table <-table(UScereal$shelf, UScereal$mfr)
barplot(freq.table, beside=TRUE)

Dot Plots

Dot plots have a long and storied history, going back to a time before computers. The type still in use today is a simple and classic method of summarizing data popularized by William S. Cleveland. You may hear to them referred to as “Cleveland dot plots” as a result.

data(UScereal)

# Calorie content in high-calorie cereals
highcal.cereal <- subset(UScereal, calories > mean(UScereal$calories))

# Here we provide labels for the dotchart using the rownames in the original data, and we provide a size setting for the dots (cex)
dotchart(highcal.cereal$calories, labels=row.names(highcal.cereal), cex=.7)

Line Graphs

Line graphs also aid in displaying the relationship between two variables, for example a variable of interest over time.

# Now we will use time series data
data(LakeHuron)

plot(LakeHuron)

Pie Charts (sigh)

Pie charts are commonly used to visualize the proportion of data belonging to certain categories. While common in news media, pie charts are controversial among data visualizers, statisticians, and data scientists. Essentially, experts argue, humans have a very difficult time visually assessing the differences among categories using a single pie chart (because it requires evaluating area rather than length, as in a bar chart), and that comparing across pie charts is nearly impossible. Great alternatives to a pie chart include bar charts and dot plots. Think carefully about whether the point you want to communicate means that you must use a pie chart, and consider alternatives.

data(iris)

iris.tab <- table(iris$Species)
pie(iris.tab)

Box Plots

Box/whisker plots summarize the distribution of data in a collapsed format, often representing the mean of the data in each of several categories, the quantiles, and any outliers. Violin plots are a variation on boxplots utilizing kernel density to illustrate the density of the data distribution rather than a collapsed summary. Check out this resource for a summary of variations on the basic boxplot.

data(iris)

boxplot(Sepal.Width~Species, data=iris)

Editing Base Graphics

You may have noticed that R includes a significant amount of information in plots by default, often including axis labels and titles or colors, but that whether and what is included varies. To edit these characteristics of plots yourself, you will commonly use the plot() command.

Chapter 12 of the R manual provides an excellent summary of all of the arguments plot() can take, as well as modifications you can make after running plot(). Stat Methods also provides an excellent overview of all your options for lines, points, colors, etc., in base R graphics. Their axes and text overview also covers how to add labels to your graph intuitively. Note that sometimes, you will be running commands in sequence rather than using optional arguments within the plot() command (e.g., with abline).

More Advanced Base Graphics

In addition to having built-in data, R also has a built-in demo for some of its more advanced graphics options. You can walk through these demos as follows:

# Overview of customizing your graphics
demo(graphics)

# Overview of heat maps and shading
demo(image)

# Overview of 3D graphics
demo(persp)

Even More Advanced Graphics: ggplot2

In addition to base graphics, R also has several packages that you can use to make even more advanced graphics. ggplot2 is one popular option. Created by Hadley Wickham (currently chief scientist at RStudio and creator of the tidyverse, which we will explore later), ggplot2 has a different syntax from base plot commands, which can make it challenging to master, but the payoffs are great.

The basics of ggplot2 are contained in the qplot() function. This is ggplot2’s analog to base plot() and includes a ton of options:

library(ggplot2)
?qplot

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
  main = NULL, xlab = deparse(substitute(x)),
  ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)

Test drive qplot using the mtcars dataset:

library(ggplot2)
qplot(mpg, wt, data = mtcars, color = cyl)

qplot(mpg, wt, data = mtcars, size = cyl)

Or use the more standard ggplot() function to create aesthetically pleasing and complex graphics, like this one which recreates our barplot from earlier.

library(ggplot2)
g <- ggplot(UScereal, aes(x = mfr, fill = factor(shelf)))
g + geom_bar(position="dodge")

This graphics section from Cookbook for R covers each of the graph types that ggplot2 can handle in more depth, including examples of syntax and options. The index for ggplot2’s documentation also provides a rundown of its main options and functions to create whatever type of plot you need.

Colors

All of your graphics should have informative titles and labels to help the audience understand your message and your data, but colors can also aid in telling your story visually. ggplot2 includes several color scales for you to choose from, or you can specify colors of your choosing, typically using hex codes. Use this guide to find colors you like manually.

In choosing colors for your graphics, there are two key principles to consider: what colors will communicate your underlying message, and what colors will be most legible to the audience. The first can help you choose a general set of colors for your graphics, while the second will serve as a constraint. For example, if you are producing a bar plot, you may want to color bars differently according to group or category. If your underlying message is that the groups significantly differ on several dimensions, you may want your colors to strongly contrast, rather than two different shades of blue.

# Returning to our cereal example from before
data(UScereal)

# Make a bar chart showing the frequency of display shelf by manufacturer
freq.table <-table(UScereal$shelf, UScereal$mfr)
# Using 3 hand-selected hex codes for colors that "strongly contrast"
barplot(freq.table, beside=TRUE, col=c("#3335FF","#33FF3F","#FF3347"))

# Using shades of blue
barplot(freq.table, beside=TRUE, col=c("#33B2FF","#3377FF","#4533FF"))

What problems do you see with using either of these two color options?

  • The “strong contrast” option, in addition to being hideous, is not colorblind-friendly and may create problems if you need to print your graphic in black-and-white or grayscale.

  • The three shades of blue might look different enough on your computer screen, but will they look sufficiently different in print? What if the projector light is not very bright? Are those colors stable to be published on the web or will they differ?

If you are curious about color theory and also generating graphics that are “safe” across multiple formats, check out the following resources:

Saving Graphics

You can save graphics in various formats to later include in LaTeX or other documents. Here are some common ways of saving your visuals out of R.

# Save out to file
pdf("myplot.pdf")
# or png("myplot.png")
# or tiff("myplot.tiff")
# or svg("myplot.svg")
# or jpeg("myplot.jpg")
plot(...)
dev.off()

# You can also manually set dimensions
pdf("myplot.pdf", width=5, height=5)
plot(...)
dev.off()

# For ggplot, you can use a native function
ggsave("myplot.pdf")

Note that you can also save several plots out to a single file, rather than manually arranging them in your final document, by using par().

par(mfrow=c(nrow, ncol))
# mfrow fills plots by row, mfcol fills by column
plot(...)

# For example, to print 6 plots in 2 columns and 3 rows and fill by row
par(mfrow=c(3,2))
plot(...)

Tools for Complex Data Analysis

The purpose of this course is not to teach statistics per se, so we will not be covering which models or methods are most appropriate for your data, but this section will give you some basics for how to do regression analysis and deal with common regression-related issues in R.

Because R is object-oriented, model output will be stored as an object that you can subsequently summarize and analyze.

R has many base functions and packages that will allow you to estimate several different types of models. Most syntax for model estimation will be of the form:

function(y ~ x1+x2, options=abc)

For a more detailed overview of regression in R, and a comparison with Stata, refer to Oscar Torres-Reyna’s overview here.

Linear Regression

For standard OLS, you will use the lm() command.

regression.1 <- lm(y ~ x1 + x2, data=my.data)

# See results
summary(regression.1)

Within-Model Options

R is flexible in allowing you to transform variables in place or use a subset of your data within the regression command alone, rather than doing these transformations beforehand.

For example, suppose you want to log all your variables.

regression.2 <- lm(log(y)~log(x), data=my.data)
summary(regression.2)

Alternatively you may want to use a subset of your data for a regression. For that you can use which() in both lm() and glm().

data(iris)

# Estimate a regression with the full data

model.1 <- lm(Sepal.Length ~ Petal.Width, data=iris)
summary(model.1)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38822 -0.29358 -0.04393  0.26429  1.34521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.77763    0.07293   65.51   <2e-16 ***
## Petal.Width  0.88858    0.05137   17.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared:  0.669,  Adjusted R-squared:  0.6668 
## F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16
# Estimate the regression only for the versicolor species
model.2 <- lm(Sepal.Length~Petal.Width, data=iris[which(iris$Species=="versicolor"),])
summary(model.2)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris[which(iris$Species == 
##     "versicolor"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84155 -0.29891  0.00109  0.31213  0.95845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0446     0.4229   9.564 1.07e-12 ***
## Petal.Width   1.4264     0.3155   4.521 4.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4368 on 48 degrees of freedom
## Multiple R-squared:  0.2986, Adjusted R-squared:  0.284 
## F-statistic: 20.44 on 1 and 48 DF,  p-value: 4.035e-05

Interaction Terms

Another common technique you may want to use is including interaction terms in your regression. You almost always want to include both the interaction term and the base, uninteracted variables.

To include an interaction term in R, you will use *. For example:

interaction.model <- lm(y~x1*x2, data=my.data)

Using this syntax will automatically include the base terms, so if you run summary(interaction.model) you will see coefficients for x1, x2, and x1:x2, which is your interaction term.

Generalized Linear Models

R has many commands for executing the most common types of models. For many of these, you will use glm(). Below are some examples of common model types and syntax.

# Logit
logit.1 <- glm(binary.y ~ x1+x2+x3, data=my.data, family="binomial")

# Probit
probit.1 <- glm(binary.y ~ x1+x2+x3, data=my.data, family=binomial(link="probit"))

# Poisson
poisson.1 <- glm(count.y ~ x1+x2+x3, data=my.data, family="poisson")

Take care as you estimate models with glm()—the model may be simple to construct in R, but interpretation can be challenging. There are also often several different ways to specify models like these, so be aware that your results may vary depending on the function or package you use to estimate your models.

Interpreting Model Output

Interpretation for your model will depend on the method you use to estimate it. Remember that nonlinear models or models with interaction terms may not have straightforward interpretations of effects. Even so, there are a few useful packages within R that will let you evaluate your model.

Marginal Effects

The effects package can be instrumental for calculating and evaluating marginal effects. Also helpfully includes the British Election Panel Study data and some data on marijuana-related arrests.

For a more detailed explanation of calculating marginal effects for logits and probits, check out this blog post. You will learn more about how to implement this in PS818: Maximum Likelihood Estimation.

Zelig

Courtesy of Gary King, we also have several functions for model estimation and evaluation in the Zelig package. In particular, the package includes imputation and missing data routines, matching, and counterfactual analysis for a wide range of model types. Check out the documentation here.

Post-Estimation Diagnostics

Often after model estimation you will want to evaluate whether the model you chose was appropriate, whether outliers overly influence your results, and whether your residuals are in fact normally distributed. To do this, you will use the fitted values and residuals objects that R created in the modeling process.

# Estimate model
my.model <- lm(y~x, data=my.data)
summary(my.model)

# Store fitted values
yhat <- fitted(my.model)

# Store residuals
resid <- residuals(my.model)

# Plot residuals versus an explanatory variable
plot(my.data$x, resid)

# Plot residuals versus fitted values
plot(yhat, resid)

In addition to visually assessing your output, there are a few options built into R that can help you assess outliers, test for nonnormality, etc. All of these are summarized in this explainer on diagnostics.

Outputting Results

As with your graphics, it is possible to output summary statistics, model results, and information related to your analysis into formats that can be used in Word or LaTeX. (Note: There are also methods to create graphics that are dynamic, for web publications, but we will not cover those here.)

This Stack Overflow post summarizes the main packages for outputting results. Each package has its own syntax and options.

Stargazer is a newer package that includes a panoply of options. apsrtable also provides options for formatting that are analogous to the presentation requirements in the #1 political science journal, the American Political Science Review.

Plotting Results

You can plot results of your regression models, rather than displaying them as tables, using many of the techniques discussed above. Commonly, however, researchers in the social sciences will use a different kind of plot—a coefficient plot, or coefplot—to directly translate tabular model output into a visual display. Coefplots will include your coefficient estimate as a point, with (typically) your 95% confidence interval as lines on either side. This allows your audience to see whether, for example, confidence intervals overlap 0.

There are several packages to do this in R, each with their own dependencies. The base function coefplot() exists in the arm package, while a coefplot package also exists (note ggplot2 is a dependency!), and Ben Bolker has created coefplot2 (which you still have to install from source rather than CRAN) that incorporates several more visual options. You can also create coefplots “by hand” using ggplot2. This Stack Overflow post discusses constructing a coefplot by hand, and this vignette by Ben Bolker provides hands-on examples with both base graphics and ggplot2.