DC 3: Plotting Data

Packages

These packages are used in this lesson:

library(ggplot2)
library(dplyr)
library(ggthemes) # install.packages("ggthemes")
library(hflights) # install.packages("hflights")

Base Graphics

R bills itself as a language and environment for statistical computing and graphics, and base R has some pretty decent capabilities. However, the ggplot2 package is ultimately more versatile and easier to use. If you should end up using ggplot2 anyway, then we will spend more time on this. There are some very useful base R plots, and I’ll cover those, but we will not spend much time customizing them. (I admit, I have actually forgotten most of what I did at one time know about base R graphics).

As I mentioned, when your data grows very large (think: 30 columns and millions of rows), taking a look at the data is not going to be very informative. This is why we want functions like mean() or summary(). Sometimes though, a picture is worth 1000 words and a simple plot can help us understand what we are looking at. Most useful among these are histograms, density plots, and scatterplots. For example a histogram or a density plot can tell us if the data is normally distributed, which is an assumption of many statistical tests.

hist(iris$Sepal.Length[iris$Species == "setosa"])

Naturally there are many options for hist() but it is most useful for just letting the analyst visualize the data.

hist(iris$Sepal.Length[iris$Species == "setosa"], breaks = 10, col = "blue", xlab = "Sepal Length", 
    main = "Histogram for Setosa")

plot(density(iris$Petal.Length[iris$Species == "virginica"]))

As we saw, scatterplots can give us a quick visualization of relationships.

plot(MTcars$disp, MTcars$hp)

We can even produce a scatterplot matrix to look at multiple relationships.

We can even create a scatterplot matrix to see multiple relationships in the data.

pairs(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,data=iris, main = "Simple Scatterplot Matrix")

We can see that there is a more clear relationship between petal length and width than there is between sepal length and width.

plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species, pch = 2)
legend(6.8, 4.4, unique(iris$Species), col = 1:length(iris$Species), pch = 2)

Adding some colour helps to see what is going on here. In short, you can do a lot with ‘base’ graphics.

GGplot2

The ‘GG’ in ggplot2 stands for ‘grammar of graphics’. The general idea is that there should be rules for hierarchically organizing information into a data visualization. This is actually quite complex and mostly beyond the scope of what we want to do today. The essential idea is that in a simple plot there are x and y coordinates and then additional information layers are added on (like the mapping of colour onto iris$Species, above). I will argue that this creates a simple, transparent, and powerful approach to creating visualizations. You can still end up with chunks of code like below to create a single plot, but the important thing is that it will (eventually - likely not today) be clear what the code is doing. And normally, you can get what you want with much less code.

theme_tufteNL <- function (base_size = 20, base_family = "serif", ticks = TRUE) 
{
    ret <- theme_bw(base_family = base_family, base_size = base_size) + 
        theme(legend.position="right", legend.background = element_blank(), legend.key = element_blank(),
            panel.background = element_blank(), panel.border = element_blank(), 
            strip.background = element_blank(), plot.background = element_blank(), 
            axis.line = element_blank(), panel.grid = element_blank())
    if (!ticks) {
        ret <- ret + theme(axis.ticks = element_blank())
    }
    ret
}


AllSta <- ggplot(AllIA1GrandAvgS, aes(x = (IDX*20), y = mean, color = IA)) + 
    geom_line() + 
    geom_errorbar(aes(ymin = mean - se, ymax = mean + se), alpha = .4, width = 0.3) + 
    geom_vline(xintercept = vlineStcueMeans, alpha = 0.5) + 
    geom_vline(xintercept = vlineStcueSD, linetype = "longdash", alpha = 0.5) + 
    scale_x_continuous(breaks = seq(0, 6000, 1000)) + scale_y_continuous(breaks = seq(0, .8, .2)) + 
    labs(title = "Statements", y = NULL, x = "Time") + 
    facet_grid(ModGroup ~.) + 
    theme_tufteNL2()

The code above creates this plot.

Simple Plot (Scatter)

# library(ggplot2)
aa <- ggplot(MTcars, aes(x = hp, y = qsec)) # The data we want to plot. "ggplot(MTcars, aes(hp, qsec))" would also work.

We have the data that we are going to plot. aes() means ‘aesthetic’.

Now to our data layer, we can add a ‘geom’ that maps the numerical data onto points.

aa + geom_point()

bb <- aa + geom_point()

We called the plot in the first line, and then assigned the plot to bb with the next. Let’s add a line that shows the relationship.

bb + geom_smooth()

## `geom_smooth()` using method = 'loess'

Constructing the plot this way is the same as writing it into one call like below.

ggplot(MTcars, aes(hp, qsec)) +
    geom_point() +
    geom_smooth()

We can also add a colour' layer into theaes()` function.

ggplot(MTcars, aes(hp, qsec, colour = cyl)) +
    geom_point() +
    geom_smooth()

## `geom_smooth()` using method = 'loess'

Is this the way we want our plot to look? How can we fix it? Think about what I said is often a source of problems. (We’ll get to labels later.)

MTcars$cyl <- as.factor(MTcars$cyl)
ggplot(MTcars, aes(hp, qsec, colour = cyl)) +
    geom_point() +
    geom_smooth()

## `geom_smooth()` using method = 'loess'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 104.65

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 18.35

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.5839e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4270.6

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 104.65

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 18.35

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 5.5839e-17

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 4270.6

This is not a great plot, but that has most to do with how hp is in explaining qsec times. Creating a new column that is a ratio of hp in relation to wt (weight) would be better. We have (at least) two ways of creating this column.

MTcars <- MTcars %>% # This uses the dplyr package
    mutate(hp2wt = hp/wt)

MTcars$hp2wtB <- MTcars$hp/MTcars$wt
identical(MTcars$hp2wt, MTcars$hp2wtB)

## [1] TRUE

As we can see, these did exactly the same thing. Let’s get rid of MTcars$hp2wtB.

MTcars$hp2wtB <- NULL

Now we will use the new column hp2wt as the new x variable.

ggplot(MTcars, aes(hp2wt, qsec, colour = cyl)) +
    geom_point() +
    geom_smooth()

## `geom_smooth()` using method = 'loess'

Our confidence intervals are not very informative because of the overlap. We can fix this by setting the fill (not colour) argument in the aes() of geom_smooth().

ggplot(MTcars, aes(hp2wt, qsec, colour = cyl)) +
    geom_point() +
    geom_smooth(aes(fill = cyl))

## `geom_smooth()` using method = 'loess'

Barplots

To make some barplots, we will first use the hflights dataset from the package of the same name. This dataset shows all the flights that left Houston, Texas’s two airports in 2011. Let’s take a look at the dataset.

head(hflights)
glimpse(hflights)
summary(hflights)

Running the line…

sort(table(hflights$Dest))

## 
##  AGS  BPT  GRK  GUC  PSP  BKG  EGE  HDN  ANC  ASE  MTJ  RNO  MLU  HOB  AVL 
##    1    3   42   86  106  110  110  110  125  125  164  243  292  309  350 
##  CRW  LCH  SJU  HNL  GJT  CID  DAY  BFL  CAE  LEX  GSO  DSM  GRR  OAK  JFK 
##  357  364  391  402  403  410  451  504  561  584  630  647  677  690  695 
##  ORF  AEX  ECP  SHV  SAV  VPS  SJC  RIC  HSV  RSW  ONT  SMF  GSP  MFE  XNA 
##  717  724  729  787  863  880  885  900  923  948  952 1014 1123 1128 1172 
##  LRD  CHS  TYS  PDX  PBI  SDF  AMA  LBB  CMH  ICT  CVG  PNS  TUS  LIT  MKE 
## 1188 1200 1210 1235 1253 1279 1297 1333 1348 1517 1535 1539 1565 1579 1588 
##  GPT  COS  SNA  PIT  MOB  BRO  RDU  IND  BOS  BTR  IAD  MSP  JAN  SLC  OMA 
## 1618 1657 1661 1664 1674 1692 1740 1750 1752 1762 1980 2010 2011 2033 2044 
##  MDW  JAX  CLE  MAF  LFT  PHL  MEM  FLL  MIA  STL  BWI  DTW  SEA  DCA  LGA 
## 2094 2135 2140 2306 2313 2367 2399 2462 2463 2509 2551 2601 2615 2699 2730 
##  BHM  ABQ  SFO  TUL  SAN  ELP  TPA  OKC  MCI  BNA  MCO  HRL  LAS  EWR  CLT 
## 2736 2812 2818 2924 2936 3036 3085 3170 3174 3481 3687 3983 4082 4314 4735 
##  CRP  SAT  AUS  PHX  ORD  DEN  LAX  DFW  MSY  ATL  DAL 
## 4813 4893 5022 5096 5748 5920 6064 6653 6823 7886 9820

… we can see that the most frequent destination for these flights was a place called ‘DAL’, which means Dallas, Texas, but these were not the only flights to Dallas, as ‘DFW’ means Dallas/Fort Worth International Airport. If we wanted to know which carriers operated the most flights to the Dallas area, what would we do?

# Let's use 'filter()' and the pipe operator.

What does the plot below show us? How would we recreate it?

DALhfli <- hflights %>%
    ...
DALhfli <- within(DALhfli, UniqueCarrier <- factor(UniqueCarrier, levels = names(sort(table(UniqueCarrier), decreasing = TRUE)))) # Let's not worry about what this does at the moment.
ggplot(DALhfli, ...

Obviously, we want to make visualizations of our data so that we can communicate relevant insights with our audiences. However, making visualizations can tell us a lot about the data that we work with. Previously, I showed you a barplot of the LeisureAtt data that looked like this.

The plots are a similar format, but there is something drastically different about the format of the datasets. What is it? The code below should create (roughly) the plot above, but it does not. We will work around this in the next section.

ggplot(LeisureAttend, aes(REPORT_PERIOD, MONTHLY_ATTENDANCE)) + geom_bar()

Setting Labels, etc.

Let’s use the LeisureAttend dataset to make a polished barplot.

aa <- ggplot(LeisureAttend, aes(REPORT_PERIOD, MONTHLY_ATTENDANCE)) + geom_bar(stat = "identity")
aa

This rough plot has multiple problems. First of all, there appears to be a lot of variation by both month, and year. Maybe it is good that the dataset reflects the timeseries nature of the data, but this layout makes it hard to make comparisons directly. It should be no surprise that there is always a big drop in attendance between August and September. Maybe it would be better to group the data by months rather than the sequentially arranged variable, REPORT_PERIOD. We can do this like this.

aa <- ggplot(LeisureAttend, aes(x = MONTH, y = MONTHLY_ATTENDANCE, fill = as.factor(YEAR)))
aa + geom_bar(stat = "identity")

Let’s re-arrange the bars for better comparison.

bb <- aa + geom_bar(stat = "identity", position = "dodge")
bb

How did we change the ’LeisureAttdataset so thatMONTHwas arranged by time rather than alphabetically? Let's do that toLeisureAttend`, as well.

LeisureAttend$MONTH <- factor(LeisureAttend$MONTH, levels = toupper(month.name))
# Now go back and recreate 'aa' and 'bb'

Yes, we will finally deal with the text on the x-axis now.

cc <- bb + theme(axis.text.x  = element_text(angle = 60, hjust = 1, vjust = 1, size = 8))
cc

Better, but how readable is that y-axis?

dd <- cc + scale_y_continuous("ATTENDANCE", labels = scales::comma)
dd

How about the legend title?

ee <- dd + guides(fill = guide_legend(title = "Year")) # There are at least five different ways to do this. This isn't my favourite.
ee

And I would be lying to say I like the colours in this plot. Let’s try something else.

ff <- ee + scale_fill_brewer(palette = "YlGnBu")
ff

This produces the same result as below (which is how I would normally do it).

ggplot(LeisureAttend, aes(x = MONTH, y = MONTHLY_ATTENDANCE, fill = as.factor(YEAR))) + 
    geom_bar(stat = "identity", position = "dodge") + 
    theme(axis.text.x  = element_text(angle=60, hjust = 1, vjust=1, size=8)) + 
    scale_y_continuous("ATTENDANCE", labels = scales::comma) +
    guides(fill = guide_legend(title = "Year")) +
    scale_fill_brewer(palette="YlGnBu")

To see some other ways we could change our plot we can use the ggthemes package in two ways. First, we can check out some of the themes. Have a look at the vignette by running: vignette("ggthemes").

ff + theme_tufte()

ff + theme_fivethirtyeight()

ff + theme_classic() # * from ggplot2

Naturally, they change the theme of the plot, but even more valuable than that is running them without the ‘()’. Running them this way, we can see some additional options that we can alter.

theme_tufte

Exercise

Try changing one aspect of the plot ff (this will be easier if you use the large code block above). Try doing a google search for adding a title, (add title ggplot2), changing the y-axis label (change y axis label ggplot2), or making the legen title bold… or whatever you want.

Bad Barplots

Read in the dataset and take a look.

dat <- read_csv("dat.csv")

## Parsed with column specification:
## cols(
##   Key = col_character(),
##   Value = col_double()
## )

datSum <- dat %>%
    group_by(Key) %>% 
    summarise(mean = mean(Value), upper = mean(Value) - sd(Value), lower = mean(Value) + sd(Value))
datSum

## # A tibble: 2 × 4
##     Key     mean    upper    lower
##   <chr>    <dbl>    <dbl>    <dbl>
## 1     A 8.673356 7.567418 9.779293
## 2     B 8.774080 7.681715 9.866445

Let’s make a plot from the summary data.

ggplot(data = datSum, aes(x = Key, y = mean)) + geom_bar(stat = "identity") + geom_errorbar(data = datSum, aes(x = Key, ymin = upper, ymax = lower), width = 0.2, size = 1, color = "blue")

But, does this plot tell the truth?

ggplot(dat, aes(Value, fill = Key)) + geom_density(alpha = .5)

This is what the data distributions actually look like, even though they have the same mean and same standard deviation.

And, in fact, there’s a statistically significant difference between the two distributions.

wilcox.test(dat$Value[dat$Key == "A"], dat$Value[dat$Key == "B"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  dat$Value[dat$Key == "A"] and dat$Value[dat$Key == "B"]
## W = 110040, p-value = 0.001049
## alternative hypothesis: true location shift is not equal to 0

# t.test(dat$Value[dat$Key == "A"], dat$Value[dat$Key == "B"])

If you are ever tempted to make a bar plot with error bars, use a boxplot instead.

ggplot(dat, aes(Key, Value)) + geom_boxplot()