DC 2: Reading In and Cleaning Data

Reading in and Cleaning “Flatfiles”

I think when most new R users (and most R users in general) think of datasets, they think in terms of a spreadsheet format. This conceptual layout is often referred to as a flatfile, as it is a two dimensional layout without hierarchical relationships to other data, as is the case in a relational database. We will only work with this spreadsheet-style data in this workshop.

When a lot of people think ‘spreadsheet’, they think a .xlsx file. The .xlsx format works very well for some types of data like budgets or class grades, but it really is not suitable for larger datasets. How attractive does opening an .xlsx file with several millions of rows and a hundred columns sound? And some of Excel’s ‘smart’ features can have very stupid consequences. (Check out this nightmare.)

In R, it is more common to use plain text files. These could have the extensions .csv or just .txt. Either way they are just text. Copy and paste the text below into a plain text editor (You can do this in RStudio, too), and save it in your project/working directory as “MTcars.txt”.

"mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
"Mazda RX4", 21, 6, 160, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4
"Mazda RX4 Wag", 21, 6, 160, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4
"Datsun 710", 22.8, 4, 108, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1
"Hornet 4 Drive", 21.4, 6, 258, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1
"Hornet Sportabout", 18.7, 8, 360, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2
"Valiant", 18.1, 6, 225, 105, 2.76, 3.46, 20.22, 1, 0, 3, 1
"Duster 360", 14.3, 8, 360, 245, 3.21, 3.57, 15.84, 0, 0, 3, 4
"Merc 240D", 24.4, 4, 146.7, 62, 3.69, 3.19, 20, 1, 0, 4, 2
"Merc 230", 22.8, 4, 140.8, 95, 3.92, 3.15, 22.9, 1, 0, 4, 2
"Merc 280", 19.2, 6, 167.6, 123, 3.92, 3.44, 18.3, 1, 0, 4, 4
"Merc 280C", 17.8, 6, 167.6, 123, 3.92, 3.44, 18.9, 1, 0, 4, 4
"Merc 450SE", 16.4, 8, 275.8, 180, 3.07, 4.07, 17.4, 0, 0, 3, 3
"Merc 450SL", 17.3, 8, 275.8, 180, 3.07, 3.73, 17.6, 0, 0, 3, 3
"Merc 450SLC", 15.2, 8, 275.8, 180, 3.07, 3.78, 18, 0, 0, 3, 3
"Cadillac Fleetwood", 10.4, 8, 472, 205, 2.93, 5.25, 17.98, 0, 0, 3, 4
"Lincoln Continental", 10.4, 8, 460, 215, 3, 5.424, 17.82, 0, 0, 3, 4
"Chrysler Imperial", 14.7, 8, 440, 230, 3.23, 5.345, 17.42, 0, 0, 3, 4
"Fiat 128", 32.4, 4, 78.7, 66, 4.08, 2.2, 19.47, 1, 1, 4, 1
"Honda Civic", 30.4, 4, 75.7, 52, 4.93, 1.615, 18.52, 1, 1, 4, 2
"Toyota Corolla", 33.9, 4, 71.1, 65, 4.22, 1.835, 19.9, 1, 1, 4, 1
"Toyota Corona", 21.5, 4, 120.1, 97, 3.7, 2.465, 20.01, 1, 0, 3, 1
"Dodge Challenger", 15.5, 8, 318, 150, 2.76, 3.52, 16.87, 0, 0, 3, 2
"AMC Javelin", 15.2, 8, 304, 150, 3.15, 3.435, 17.3, 0, 0, 3, 2
"Camaro Z28", 13.3, 8, 350, 245, 3.73, 3.84, 15.41, 0, 0, 3, 4
"Pontiac Firebird", 19.2, 8, 400, 175, 3.08, 3.845, 17.05, 0, 0, 3, 2
"Fiat X1-9", 27.3, 4, 79, 66, 4.08, 1.935, 18.9, 1, 1, 4, 1
"Porsche 914-2", 26, 4, 120.3, 91, 4.43, 2.14, 16.7, 0, 1, 5, 2
"Lotus Europa", 30.4, 4, 95.1, 113, 3.77, 1.513, 16.9, 1, 1, 5, 2
"Ford Pantera L", 15.8, 8, 351, 264, 4.22, 3.17, 14.5, 0, 1, 5, 4
"Ferrari Dino", 19.7, 6, 145, 175, 3.62, 2.77, 15.5, 0, 1, 5, 6
"Maserati Bora", 15, 8, 301, 335, 3.54, 3.57, 14.6, 0, 1, 5, 8
"Volvo 142E", 21.4, 4, 121, 109, 4.11, 2.78, 18.6, 1, 1, 4, 2

Now run this code.

MT <- read.table("MTcars.txt", sep = ",", header = TRUE) 
# The 'sep =' could be anything, but whitespace (" "), semicolon, and Tab ("\t") are also common.
head(MT)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Because we knew that our values were comma-separated, we could have just used read.csv() which is just the read.table() function with the separator pre-set. (If you did not know previously, or have not guessed yet, the .csv extension means “Comma-Separated Values”.)

MT2 <- read.csv("MTcars.txt") # We could change the .txt to .csv, but it's unnecessary.
identical(MT, MT2)

## [1] TRUE

We can see that read.csv() is just a wrapper function for read.table() by entering read.csv (note the lack of parantheses).

read.csv

## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## read.table(file = file, header = header, sep = sep, quote = quote, 
##     dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x7ff4f20af8a0>
## <environment: namespace:utils>

read.table() has many more options/arguments as you can see from the output above. We can find out what these arguments are for any function by entering ?<function_name>.

?read.table

Most of these arguments have defaults and/or are optional so they do not need to be explicitly entered. See that the defaults for strip.white and blank.lines.skip are FALSE and TRUE, respectively.

Note also that at the top of the Help page it reads ‘read.table {utils}’. This means that the function read.table() is part of the ‘utils’ package. There are several packages that are part of every R install. The most important of which are ‘base’, ‘utils’, ‘stats’, and ‘graphics’. Additional packages can be installed from CRAN using the command install.packages("<package_name>"), or clicking the Packages tab in RStudio. Packages still need to be loaded to be used, however. This is done using the library(<package_name>) (quotes optional).

The reason we are talking about packages here is that RStudio likes to use the ‘readr’ package to read in flatfiles. Let’s use the ‘Import Dataset’ button on the Environment pane. Choose ‘From CSV’ and then ‘Browse…’. Select and open the ‘MTcars.txt’ file.

RStudio gives us a preview of what is going to be read in. The readr package is an excellent package… but here, we have a problem. Where read.csv() worked perfectly, read_csv() is not going to. What are some possible solutions?

library(readr)
MTcars <- read_csv("<filepath>/MTcars.txt")
head(MTcars)
str(MTcars)

Running str() on both MTcars and MT (or MT2), we can see there are some differences (besides our row names work around). What are they?

We had problems because of row names. Row names were a bad feature. They should be an additional data column instead. Packages like readr and readxl are more consistent and often much faster than some of the base functionality in R. That said, I use both ‘read.csv()’ and ‘read_csv()’ to read in data. Loading and using different packages is a necessary thing to do.

Selecting and Subsetting Data

So we have a dataset MTcars. This dataset is tiny. We could probably get a reasonable grasp of the values that are contain within simply by eyeballing them. Our ability to do this rapidly diminishes as the dataset grows, however. To look at specific values we need to be able to find them in our dataset. The most basic way is to select a row and column numerically. Let’s look at the model column of MTcars.

MTcars[1,1] # This selects the first row and first column of MTcars

## # A tibble: 1 × 1
##       model
##       <chr>
## 1 Mazda RX4

MTcars[3,1] # The third value from the first column.

## # A tibble: 1 × 1
##        model
##        <chr>
## 1 Datsun 710

MTcars[1:3,1] # The first three.

## # A tibble: 3 × 1
##           model
##           <chr>
## 1     Mazda RX4
## 2 Mazda RX4 Wag
## 3    Datsun 710

MTcars[c(3, 6, 10, 20:23, 28),1] # A specific selection

## # A tibble: 8 × 1
##              model
##              <chr>
## 1       Datsun 710
## 2          Valiant
## 3         Merc 280
## 4   Toyota Corolla
## 5    Toyota Corona
## 6 Dodge Challenger
## 7      AMC Javelin
## 8     Lotus Europa

MTcars[seq(from = 1, to = nrow(MTcars), by = 2),1] # Can we tell what this did?

## # A tibble: 16 × 1
##                 model
##                 <chr>
## 1           Mazda RX4
## 2          Datsun 710
## 3   Hornet Sportabout
## 4          Duster 360
## 5            Merc 230
## 6           Merc 280C
## 7          Merc 450SL
## 8  Cadillac Fleetwood
## 9   Chrysler Imperial
## 10        Honda Civic
## 11      Toyota Corona
## 12        AMC Javelin
## 13   Pontiac Firebird
## 14      Porsche 914-2
## 15     Ford Pantera L
## 16      Maserati Bora

We can do the same with columns.

MTcars[1,2] # The MPG value for the Mazda RX4

## # A tibble: 1 × 1
##     mpg
##   <dbl>
## 1    21

Does this return a single value? What happens when we str() it. Hold this thought.

We can get all the values in the mpg column just by leaving rows blank. (This works for rows as well, of course).

head(MTcars[,2])

## # A tibble: 6 × 1
##     mpg
##   <dbl>
## 1  21.0
## 2  21.0
## 3  22.8
## 4  21.4
## 5  18.7
## 6  18.1

We have another way of selecting values, and this way will just return the value. We use the dollar sign $ which we can think of as meaning ‘select’.

head(MTcars$mpg)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1

Notice the difference in the output. This is a vector of values, not a dataframe (or tibble).

MTcars$mpg[1] # This gives us the first value in the MPG column.

## [1] 21

str(MTcars$mpg[1])

##  num 21

With this knowledge we can get all sorts of info about our columns.

mean(MTcars$hp, na.rm = TRUE) # The extra argument here removes NAs from the data. We don't have any, but I want you to see it.

## [1] 146.6875

sd(MTcars$mpg) # Standard Deviation

## [1] 6.026948

min(MTcars$qsec) # Seconds for a quarter mile. The MT stands for 'Motor Trend' magazine, after all.

## [1] 14.5

range(MTcars$qsec)

## [1] 14.5 22.9

plot(MTcars$hp, MTcars$qsec) # Yep, more horsepower correlates with faster times

# cor(MTcars$hp, MTcars$qsec) # Gives a negative correlation of -0.7082234

What if we only wanted to compare the tested cars that had manual transmissions. The column am means ‘automatic/manual’, with ‘automatic’ assigned the value zero and manual to the value 1. We could make a separate dataset. Here we have some options.

Subset with a Logical Vector

MTcars_man1 <- MTcars[MTcars$am == 1, ] # Note that we use a double "="
head(MTcars_man1)

## # A tibble: 6 × 12
##            model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##            <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
## 1      Mazda RX4  21.0     6 160.0   110  3.90 2.620 16.46     0     1
## 2  Mazda RX4 Wag  21.0     6 160.0   110  3.90 2.875 17.02     0     1
## 3     Datsun 710  22.8     4 108.0    93  3.85 2.320 18.61     1     1
## 4       Fiat 128  32.4     4  78.7    66  4.08 2.200 19.47     1     1
## 5    Honda Civic  30.4     4  75.7    52  4.93 1.615 18.52     1     1
## 6 Toyota Corolla  33.9     4  71.1    65  4.22 1.835 19.90     1     1
## # ... with 2 more variables: gear <int>, carb <int>

Let’s think about what this did. MTcars$am == 1 creates a logical vector of TRUE and FALSE. For all of the TRUE values, a row is kept in the new dataframe.

Subset with the Function subset()

MTcars_man2 <- subset(MTcars, am == 1)
identical(MTcars_man1, MTcars_man2)

## [1] TRUE

Two methods; same result.

Subset with filter() A third method is to use the dplr package’s filter().

library(dplyr)
MTcars_man3 <- filter(MTcars, am == 1)
identical(MTcars_man1, MTcars_man3)

## [1] FALSE

Really?!

table(MTcars_man1 == MTcars_man3)

## 
## TRUE 
##  156

If we run str() on MTcars_man3 we see that there is some additional metadata in the dataframe. However, if we use the table() function, we can see that all the individual values in each cell are the same. So, the dataframes are the same.

Now we can make a plot with just the manual transmission cars.

plot(MTcars_man1$hp, MTcars_man1$qsec)

# cor(MTcars_man1$hp, MTcars_man1$qsec) # An even stronger correlation here: -0.8494566

We do not need to make entirely new dataframes to get simple statistics. For example, we can get the mean of hp simply by subsetting within the function call.

mean(MTcars$hp[MTcars$am == 1], na.rm = T)

## [1] 126.8462

# mean(MTcars$hp[MTcars$am == 1], na.rm = T) == mean(MTcars_man1$hp, na.rm = T) # [1] TRUE

We can even put in a number of logical expressions. Can you figure out what happens in these function calls?

mean(MTcars$hp[MTcars$am == 1 & MTcars$cyl > 4], na.rm = T)

## [1] 198.8

median(MTcars$hp[MTcars$am == 0 & MTcars$mpg <= 16], na.rm = T)

## [1] 210

range(MTcars$hp[MTcars$disp >= median(MTcars$disp) | MTcars$cyl >= 8], na.rm = T) # I'm using two criteria to select my 'big' engine cars.

## [1] 105 335

Some of this code becomes very difficult even to read let alone to construct oneself. We never need to do things all in one line. If ran range(MTcars$hp[MTcars$disp >= median(MTcars$disp) | MTcars$cyl >= 8], na.rm = T) over multiple lines starting from the most embedded function call, how could we break this down more simply?

I’ll start us out.

aa <- median(MTcars$disp)
bb <- 
cc <- 
dd <- 
ee <- 
range(ee, na.rm = T) # Needs to output: [1] 105 335

Tidy Data

Let’s read in some YEG Open Data (that I’ve messed with), and take a look at it.

LeisureAtt <- read.csv("LeisureAtt.csv") # Note we didn't use readr's read_csv()
head(LeisureAtt)

##      MONTH  X2011  X2012  X2013  X2014  X2015  X2016
## 1    APRIL 323137 355250 416776 389360 496258 507983
## 2   AUGUST 291265 453891 384024 321728 513189 466961
## 3 DECEMBER 263062 270756 268112 371324 402371 424747
## 4 FEBRUARY 271124 329109 339873 365523 501420 509315
## 5  JANUARY 230367 282755 353591 351448 503447 510308
## 6     JULY 320698 342683 369214 375803 523272 487693

This is the complete data set. What do you think of the layout? In some ways this data is fine (in fact it could even be better than the format we are about to learn). However, it violates the principles of ‘tidy data’. Think about how this data should look to plot it most easily. If we have an x and a y axis to plot what would they be, and can we match a single variable to x and another to y? If we wanted to perform a single operation, like rounding to the nearest hundred, to all the important values (the attendance at each observation) how could the data be laid out to make this (and everything else we do to our data) easier?

The principles of tidy data were described by Hadley Wickham and you can learn more about them here. The main principle of tidy data is that each variable is column and each observation is a row. Does the data above follow this principle? Even though the data above is easy to look at, and would probably how you would format the data in a presentation, it is not tidy.

names(LeisureAtt) <- gsub("X", "", names(LeisureAtt)) # First let's get rid of that pesky "X"
library(tidyr) # A package with tidying functions like gather(), and its reverse, spread()
LeisureAtt <- gather(LeisureAtt, YEAR, ATTENDANCE, 2:7)
head(LeisureAtt, 3)

##      MONTH YEAR ATTENDANCE
## 1    APRIL 2011     323137
## 2   AUGUST 2011     291265
## 3 DECEMBER 2011     263062

Did this do what we want? Any problems? Let’s take a look at a simple plot. (We’ll cover plotting later).

library(ggplot2) # function ggplot()
ggplot(LeisureAtt, aes(x = MONTH, y = ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x  = element_text(angle=60, hjust = 1, vjust=1, size=8)) + facet_wrap(~ YEAR, ncol = 3)

Of course, we can fix this by changing the type of factor we have for MONTH.

str(LeisureAtt$MONTH) # If we had used read_csv() this would be a character vector, but we wouldn't have had the "X" on the year names.

##  Factor w/ 12 levels "APRIL","AUGUST",..: 1 2 3 4 5 6 7 8 9 10 ...

levels(LeisureAtt$MONTH)

##  [1] "APRIL"     "AUGUST"    "DECEMBER"  "FEBRUARY"  "JANUARY"  
##  [6] "JULY"      "JUNE"      "MARCH"     "MAY"       "NOVEMBER" 
## [11] "OCTOBER"   "SEPTEMBER"

LeisureAtt$MONTH <- factor(LeisureAtt$MONTH, levels = toupper(month.name)) # Note about the toupper(month.name)
levels(LeisureAtt$MONTH)

##  [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"      
##  [6] "JUNE"      "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"  
## [11] "NOVEMBER"  "DECEMBER"

head(LeisureAtt)

##      MONTH YEAR ATTENDANCE
## 1    APRIL 2011     323137
## 2   AUGUST 2011     291265
## 3 DECEMBER 2011     263062
## 4 FEBRUARY 2011     271124
## 5  JANUARY 2011     230367
## 6     JULY 2011     320698

Still looks the same, but if we plot it again, we see that we get what we want.

ggplot(LeisureAtt, aes(x = MONTH, y = ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x  = element_text(angle=60, hjust = 1, vjust=1, size=8)) + facet_wrap(~ YEAR, ncol = 3)

Because our data was tidy(ish), plotting is much simpler. The data for x and y are consolidated into separate columns. This might be harder for us to look at, but when we start dealing with larger datasets, ‘looking’ is not going to benefit us rather we will want to use plots or functions like range(), summary(), mean(), summary(), etc, to get a sense of what our dataset contains. These operations are much easier to run, and good for us conceptually, when variables are represented in single columns.

Reading in Files from the Web

Edmonton has a great open data portal, which you can find here: https://data.edmonton.ca/. Let’s take a look at what the Leisure Centre Attendance dataset looked like when I found it.

LeisureAttend <- read.csv(url("https://dashboard.edmonton.ca/api/views/iaa7-x8kk/rows.csv"))
head(LeisureAttend, 5)[,c(2:5,7)] # I'm omitting some columns for aesthetic reasons

##                 DateTime MONTH_NUMBER    MONTH YEAR MONTHLY_ATTENDANCE
## 1 01/01/2011 12:00:00 AM            1  JANUARY 2011                  0
## 2 01/31/2011 12:00:00 AM            1  JANUARY 2011             230367
## 3 02/28/2011 12:00:00 AM            2 FEBRUARY 2011             271124
## 4 03/31/2011 12:00:00 AM            3    MARCH 2011             337191
## 5 04/30/2011 12:00:00 AM            4    APRIL 2011             323137

This dataset, unlike the one I created mostly conforms with the principles of tidy data, but there is something else odd here. What is it and what can we do about it? Without thinking of functions, what are some way we could modify this dataframe?

str(LeisureAttend)

## 'data.frame':    84 obs. of  8 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DateTime          : Factor w/ 83 levels "01/01/2011 12:00:00 AM",..: 1 7 14 21 28 36 42 48 54 60 ...
##  $ MONTH_NUMBER      : int  1 1 2 3 4 5 6 7 8 9 ...
##  $ MONTH             : Factor w/ 12 levels "APRIL","AUGUST",..: 5 5 4 8 1 9 7 6 2 12 ...
##  $ YEAR              : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ REPORT_PERIOD     : Factor w/ 77 levels "11-Apr","11-Aug",..: 5 5 4 8 1 9 7 6 2 12 ...
##  $ MONTHLY_ATTENDANCE: int  0 230367 271124 337191 323137 314676 304455 320698 291265 243688 ...
##  $ TARGET            : int  NA NA NA NA NA NA NA NA NA NA ...

range(LeisureAttend$MONTHLY_ATTENDANCE)

## [1]      0 592611

LeisureAttend <- filter(LeisureAttend, MONTHLY_ATTENDANCE != 0)

ggplot(LeisureAttend, aes(x = REPORT_PERIOD, y = MONTHLY_ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x  = element_text(angle=60, hjust = 1, vjust=1, size=8))

We will do more with this data when we get to plotting.

Renaming Values and Dealing with Factors

What we are about to do is, in this case, completely unnecessary, but is useful to know. In the MTcars dataset some of the values are not very intuitive. Transmission type is denoted by a numeric value where it might be easier to read if those values were abbreviations like ‘auto’ and ‘man’. (I always forget what vs means, - I think it is the cylinder arrangement inline cylinders or not.) Let’s change the numeric values in MTcars$am to strings ‘auto’ and ‘man’. And let’s do this a slightly safer way.

xx <- MTcars$am
xx[xx == 0] <- "auto"
xx[MTcars$am == 1] <- "man" # This is a bit odd to do, but I did it to illustrate a point. Please ask, why?
MTcars$amChar <- xx # What did this do?
rm(xx)
identical((MTcars$am == 0), (MTcars$amChar == "auto"))

## [1] TRUE

table((MTcars$am == 0) == (MTcars$amChar == "auto"))

## 
## TRUE 
##   32

What if we wanted to create factor variable for engine size called eng_size, where engines with a displacement one standard deviation above than the mean are classed “big”, below “small”, and within “avr” for ‘average’? We could do this with a ‘for’ loop and some ‘if/else’ statements like below.

xx <- character()
for (i in 1:nrow(MTcars)) {
    if (MTcars$disp[i] > (mean(MTcars$disp) + sd(MTcars$disp))) {
        xx[i] <- "big"
    } else if (MTcars$disp[i] < (mean(MTcars$disp) - sd(MTcars$disp))) {
        xx[i] <- "small"
    } else {
        xx[i] <- "avr"
    }
}
MTcars$eng_size <- factor(xx, levels = c("small", "avr", "big"), ordered = TRUE)
str(MTcars$eng_size)

##  Ord.factor w/ 3 levels "small"<"avr"<..: 2 2 2 2 3 2 3 2 2 2 ...

It is good to know about how to write ‘for’, ‘while’ and ‘if’ statements when you learn about programming in general, but I want to teach you how to use R. The way to use R properly is to avoid using these whenever possible. An experienced R programmer would know there is a function that does this for you (one that I should have, but did not know about for the first two years I used R). It is called ifelse().

# ifelse("Is this TRUE?", "Yes, so = ", "No, so = ")
xx <- ifelse(MTcars$disp > (mean(MTcars$disp) + sd(MTcars$disp)), "big", ifelse(MTcars$disp < (mean(MTcars$disp) - sd(MTcars$disp)), "small", "avr"))
xx <- factor(xx, levels = c("small", "avr", "big"), ordered = TRUE)
identical(xx, MTcars$eng_size)

## [1] TRUE

Not only is ifelse() easier to write and read (i.e., elegant), it is also much, much, much, much faster than using a ‘for’ loop. R is not a fast programming language by any means. While ifelse() runs within R, many functions like those in the dplyr package actually run outside of R, in this case in C++. Finding a specific function in another R package can sometimes turn an hour of computational time into seconds, literally. (You can wrap a function call in system.time() to compare computation speeds).

When we changed the zeros to “auto”, the vector xx was changed into a character vector. If we try to do this with factors we will have trouble.

xx <- MTcars$eng_size
xx[xx == "big"] <- "large"

## Warning in `[<-.factor`(`*tmp*`, xx == "big", value = "large"): invalid
## factor level, NA generated

The solution here is to use a line of code like this:

xx <- MTcars$eng_size
levels(xx)[levels(xx) == "big"] <- "large"

Of course, we could also just turn xx into a character vector and modify it that way too.

Factors also have another annoying habit. They will not go away even after they are removed. Let’s pretend we only want the summer months from ’LeisureAttend`.

LeisAttSummer <- filter(LeisureAttend, MONTH %in% c("JULY", "AUGUST", "SEPTEMBER"))
str(LeisAttSummer$MONTH)

##  Factor w/ 12 levels "APRIL","AUGUST",..: 6 2 12 6 2 12 2 12 6 6 ...

We still have a factor with 12 levels. The solution is to use droplevels(). We could have just added this onto the tail of the original filter() call.

LeisAttSummer <- droplevels(filter(LeisureAttend, MONTH %in% c("JULY", "AUGUST", "SEPTEMBER")))
str(LeisAttSummer$MONTH)

##  Factor w/ 3 levels "AUGUST","JULY",..: 2 1 3 2 1 3 1 3 2 2 ...

The Pipe: `%>%`

The pipe was an innovation originally from the magrittr package, but is now frequently used in many packages especially dplyr and tidyr. Code can be conceptually hard to read as it typically becomes more and more embedded, meaning that the first thing your code is doing is usually in the centre within the innermost “()”. With the pipe operator, you can mostly think of your code to mean “Take these data, and then (%>%), and then (%>%)…”. How would we re-write the code above without the pipe?

LeisAttSummer <-

DC 2: Reading In and Cleaning Data

Brian Rusk

July 12th & 13th, 2017

Packages Used

Reading in and Cleaning “Flatfiles”

Selecting and Subsetting Data

Tidy Data

Reading in Files from the Web

Renaming Values and Dealing with Factors

The Pipe: `%>%`

Next: Plotting Your Data

DC 2: Reading In and Cleaning Data

Brian Rusk

July 12th & 13th, 2017

Packages Used

Reading in and Cleaning “Flatfiles”

Selecting and Subsetting Data

Tidy Data

Reading in Files from the Web

Renaming Values and Dealing with Factors

The Pipe: %>%

Next: Plotting Your Data

The Pipe: `%>%`