library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
I think when most new R users (and most R users in general) think of datasets, they think in terms of a spreadsheet format. This conceptual layout is often referred to as a flatfile, as it is a two dimensional layout without hierarchical relationships to other data, as is the case in a relational database. We will only work with this spreadsheet-style data in this workshop.
When a lot of people think ‘spreadsheet’, they think a .xlsx file. The .xlsx format works very well for some types of data like budgets or class grades, but it really is not suitable for larger datasets. How attractive does opening an .xlsx file with several millions of rows and a hundred columns sound? And some of Excel’s ‘smart’ features can have very stupid consequences. (Check out this nightmare.)
In R, it is more common to use plain text files. These could have the extensions .csv or just .txt. Either way they are just text. Copy and paste the text below into a plain text editor (You can do this in RStudio, too), and save it in your project/working directory as “MTcars.txt”.
"mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
"Mazda RX4", 21, 6, 160, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4
"Mazda RX4 Wag", 21, 6, 160, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4
"Datsun 710", 22.8, 4, 108, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1
"Hornet 4 Drive", 21.4, 6, 258, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1
"Hornet Sportabout", 18.7, 8, 360, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2
"Valiant", 18.1, 6, 225, 105, 2.76, 3.46, 20.22, 1, 0, 3, 1
"Duster 360", 14.3, 8, 360, 245, 3.21, 3.57, 15.84, 0, 0, 3, 4
"Merc 240D", 24.4, 4, 146.7, 62, 3.69, 3.19, 20, 1, 0, 4, 2
"Merc 230", 22.8, 4, 140.8, 95, 3.92, 3.15, 22.9, 1, 0, 4, 2
"Merc 280", 19.2, 6, 167.6, 123, 3.92, 3.44, 18.3, 1, 0, 4, 4
"Merc 280C", 17.8, 6, 167.6, 123, 3.92, 3.44, 18.9, 1, 0, 4, 4
"Merc 450SE", 16.4, 8, 275.8, 180, 3.07, 4.07, 17.4, 0, 0, 3, 3
"Merc 450SL", 17.3, 8, 275.8, 180, 3.07, 3.73, 17.6, 0, 0, 3, 3
"Merc 450SLC", 15.2, 8, 275.8, 180, 3.07, 3.78, 18, 0, 0, 3, 3
"Cadillac Fleetwood", 10.4, 8, 472, 205, 2.93, 5.25, 17.98, 0, 0, 3, 4
"Lincoln Continental", 10.4, 8, 460, 215, 3, 5.424, 17.82, 0, 0, 3, 4
"Chrysler Imperial", 14.7, 8, 440, 230, 3.23, 5.345, 17.42, 0, 0, 3, 4
"Fiat 128", 32.4, 4, 78.7, 66, 4.08, 2.2, 19.47, 1, 1, 4, 1
"Honda Civic", 30.4, 4, 75.7, 52, 4.93, 1.615, 18.52, 1, 1, 4, 2
"Toyota Corolla", 33.9, 4, 71.1, 65, 4.22, 1.835, 19.9, 1, 1, 4, 1
"Toyota Corona", 21.5, 4, 120.1, 97, 3.7, 2.465, 20.01, 1, 0, 3, 1
"Dodge Challenger", 15.5, 8, 318, 150, 2.76, 3.52, 16.87, 0, 0, 3, 2
"AMC Javelin", 15.2, 8, 304, 150, 3.15, 3.435, 17.3, 0, 0, 3, 2
"Camaro Z28", 13.3, 8, 350, 245, 3.73, 3.84, 15.41, 0, 0, 3, 4
"Pontiac Firebird", 19.2, 8, 400, 175, 3.08, 3.845, 17.05, 0, 0, 3, 2
"Fiat X1-9", 27.3, 4, 79, 66, 4.08, 1.935, 18.9, 1, 1, 4, 1
"Porsche 914-2", 26, 4, 120.3, 91, 4.43, 2.14, 16.7, 0, 1, 5, 2
"Lotus Europa", 30.4, 4, 95.1, 113, 3.77, 1.513, 16.9, 1, 1, 5, 2
"Ford Pantera L", 15.8, 8, 351, 264, 4.22, 3.17, 14.5, 0, 1, 5, 4
"Ferrari Dino", 19.7, 6, 145, 175, 3.62, 2.77, 15.5, 0, 1, 5, 6
"Maserati Bora", 15, 8, 301, 335, 3.54, 3.57, 14.6, 0, 1, 5, 8
"Volvo 142E", 21.4, 4, 121, 109, 4.11, 2.78, 18.6, 1, 1, 4, 2
Now run this code.
MT <- read.table("MTcars.txt", sep = ",", header = TRUE)
# The 'sep =' could be anything, but whitespace (" "), semicolon, and Tab ("\t") are also common.
head(MT)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Because we knew that our values were comma-separated, we could have just used read.csv()
which is just the read.table()
function with the separator pre-set. (If you did not know previously, or have not guessed yet, the .csv extension means “Comma-Separated Values”.)
MT2 <- read.csv("MTcars.txt") # We could change the .txt to .csv, but it's unnecessary.
identical(MT, MT2)
## [1] TRUE
We can see that read.csv()
is just a wrapper function for read.table()
by entering read.csv
(note the lack of parantheses).
read.csv
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## read.table(file = file, header = header, sep = sep, quote = quote,
## dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x7ff4f20af8a0>
## <environment: namespace:utils>
read.table()
has many more options/arguments as you can see from the output above. We can find out what these arguments are for any function by entering ?<function_name>
.
?read.table
Most of these arguments have defaults and/or are optional so they do not need to be explicitly entered. See that the defaults for strip.white
and blank.lines.skip
are FALSE
and TRUE
, respectively.
Note also that at the top of the Help page it reads ‘read.table {utils}’. This means that the function read.table()
is part of the ‘utils’ package. There are several packages that are part of every R install. The most important of which are ‘base’, ‘utils’, ‘stats’, and ‘graphics’. Additional packages can be installed from CRAN using the command install.packages("<package_name>")
, or clicking the Packages tab in RStudio. Packages still need to be loaded to be used, however. This is done using the library(<package_name>)
(quotes optional).
The reason we are talking about packages here is that RStudio likes to use the ‘readr’ package to read in flatfiles. Let’s use the ‘Import Dataset’ button on the Environment pane. Choose ‘From CSV’ and then ‘Browse…’. Select and open the ‘MTcars.txt’ file.
RStudio gives us a preview of what is going to be read in. The readr package is an excellent package… but here, we have a problem. Where read.csv()
worked perfectly, read_csv()
is not going to. What are some possible solutions?
library(readr)
MTcars <- read_csv("<filepath>/MTcars.txt")
head(MTcars)
str(MTcars)
Running str()
on both MTcars
and MT
(or MT2
), we can see there are some differences (besides our row names work around). What are they?
We had problems because of row names. Row names were a bad feature. They should be an additional data column instead. Packages like readr and readxl are more consistent and often much faster than some of the base functionality in R. That said, I use both ‘read.csv()’ and ‘read_csv()’ to read in data. Loading and using different packages is a necessary thing to do.
So we have a dataset MTcars
. This dataset is tiny. We could probably get a reasonable grasp of the values that are contain within simply by eyeballing them. Our ability to do this rapidly diminishes as the dataset grows, however. To look at specific values we need to be able to find them in our dataset. The most basic way is to select a row and column numerically. Let’s look at the model
column of MTcars
.
MTcars[1,1] # This selects the first row and first column of MTcars
## # A tibble: 1 × 1
## model
## <chr>
## 1 Mazda RX4
MTcars[3,1] # The third value from the first column.
## # A tibble: 1 × 1
## model
## <chr>
## 1 Datsun 710
MTcars[1:3,1] # The first three.
## # A tibble: 3 × 1
## model
## <chr>
## 1 Mazda RX4
## 2 Mazda RX4 Wag
## 3 Datsun 710
MTcars[c(3, 6, 10, 20:23, 28),1] # A specific selection
## # A tibble: 8 × 1
## model
## <chr>
## 1 Datsun 710
## 2 Valiant
## 3 Merc 280
## 4 Toyota Corolla
## 5 Toyota Corona
## 6 Dodge Challenger
## 7 AMC Javelin
## 8 Lotus Europa
MTcars[seq(from = 1, to = nrow(MTcars), by = 2),1] # Can we tell what this did?
## # A tibble: 16 × 1
## model
## <chr>
## 1 Mazda RX4
## 2 Datsun 710
## 3 Hornet Sportabout
## 4 Duster 360
## 5 Merc 230
## 6 Merc 280C
## 7 Merc 450SL
## 8 Cadillac Fleetwood
## 9 Chrysler Imperial
## 10 Honda Civic
## 11 Toyota Corona
## 12 AMC Javelin
## 13 Pontiac Firebird
## 14 Porsche 914-2
## 15 Ford Pantera L
## 16 Maserati Bora
We can do the same with columns.
MTcars[1,2] # The MPG value for the Mazda RX4
## # A tibble: 1 × 1
## mpg
## <dbl>
## 1 21
Does this return a single value? What happens when we str()
it. Hold this thought.
We can get all the values in the mpg
column just by leaving rows blank. (This works for rows as well, of course).
head(MTcars[,2])
## # A tibble: 6 × 1
## mpg
## <dbl>
## 1 21.0
## 2 21.0
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
We have another way of selecting values, and this way will just return the value. We use the dollar sign $
which we can think of as meaning ‘select’.
head(MTcars$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
Notice the difference in the output. This is a vector of values, not a dataframe (or tibble).
MTcars$mpg[1] # This gives us the first value in the MPG column.
## [1] 21
str(MTcars$mpg[1])
## num 21
With this knowledge we can get all sorts of info about our columns.
mean(MTcars$hp, na.rm = TRUE) # The extra argument here removes NAs from the data. We don't have any, but I want you to see it.
## [1] 146.6875
sd(MTcars$mpg) # Standard Deviation
## [1] 6.026948
min(MTcars$qsec) # Seconds for a quarter mile. The MT stands for 'Motor Trend' magazine, after all.
## [1] 14.5
range(MTcars$qsec)
## [1] 14.5 22.9
plot(MTcars$hp, MTcars$qsec) # Yep, more horsepower correlates with faster times
# cor(MTcars$hp, MTcars$qsec) # Gives a negative correlation of -0.7082234
What if we only wanted to compare the tested cars that had manual transmissions. The column am
means ‘automatic/manual’, with ‘automatic’ assigned the value zero and manual to the value 1. We could make a separate dataset. Here we have some options.
Subset with a Logical Vector
MTcars_man1 <- MTcars[MTcars$am == 1, ] # Note that we use a double "="
head(MTcars_man1)
## # A tibble: 6 × 12
## model mpg cyl disp hp drat wt qsec vs am
## <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
## 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1
## 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1
## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1
## 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1
## 5 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1
## 6 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1
## # ... with 2 more variables: gear <int>, carb <int>
Let’s think about what this did. MTcars$am == 1
creates a logical vector of TRUE
and FALSE
. For all of the TRUE
values, a row is kept in the new dataframe.
Subset with the Function subset()
MTcars_man2 <- subset(MTcars, am == 1)
identical(MTcars_man1, MTcars_man2)
## [1] TRUE
Two methods; same result.
Subset with filter()
A third method is to use the dplr
package’s filter()
.
library(dplyr)
MTcars_man3 <- filter(MTcars, am == 1)
identical(MTcars_man1, MTcars_man3)
## [1] FALSE
Really?!
table(MTcars_man1 == MTcars_man3)
##
## TRUE
## 156
If we run str()
on MTcars_man3
we see that there is some additional metadata in the dataframe. However, if we use the table()
function, we can see that all the individual values in each cell are the same. So, the dataframes are the same.
Now we can make a plot with just the manual transmission cars.
plot(MTcars_man1$hp, MTcars_man1$qsec)
# cor(MTcars_man1$hp, MTcars_man1$qsec) # An even stronger correlation here: -0.8494566
We do not need to make entirely new dataframes to get simple statistics. For example, we can get the mean of hp
simply by subsetting within the function call.
mean(MTcars$hp[MTcars$am == 1], na.rm = T)
## [1] 126.8462
# mean(MTcars$hp[MTcars$am == 1], na.rm = T) == mean(MTcars_man1$hp, na.rm = T) # [1] TRUE
We can even put in a number of logical expressions. Can you figure out what happens in these function calls?
mean(MTcars$hp[MTcars$am == 1 & MTcars$cyl > 4], na.rm = T)
## [1] 198.8
median(MTcars$hp[MTcars$am == 0 & MTcars$mpg <= 16], na.rm = T)
## [1] 210
range(MTcars$hp[MTcars$disp >= median(MTcars$disp) | MTcars$cyl >= 8], na.rm = T) # I'm using two criteria to select my 'big' engine cars.
## [1] 105 335
Some of this code becomes very difficult even to read let alone to construct oneself. We never need to do things all in one line. If ran range(MTcars$hp[MTcars$disp >= median(MTcars$disp) | MTcars$cyl >= 8], na.rm = T)
over multiple lines starting from the most embedded function call, how could we break this down more simply?
I’ll start us out.
aa <- median(MTcars$disp)
bb <-
cc <-
dd <-
ee <-
range(ee, na.rm = T) # Needs to output: [1] 105 335
Let’s read in some YEG Open Data (that I’ve messed with), and take a look at it.
LeisureAtt <- read.csv("LeisureAtt.csv") # Note we didn't use readr's read_csv()
head(LeisureAtt)
## MONTH X2011 X2012 X2013 X2014 X2015 X2016
## 1 APRIL 323137 355250 416776 389360 496258 507983
## 2 AUGUST 291265 453891 384024 321728 513189 466961
## 3 DECEMBER 263062 270756 268112 371324 402371 424747
## 4 FEBRUARY 271124 329109 339873 365523 501420 509315
## 5 JANUARY 230367 282755 353591 351448 503447 510308
## 6 JULY 320698 342683 369214 375803 523272 487693
This is the complete data set. What do you think of the layout? In some ways this data is fine (in fact it could even be better than the format we are about to learn). However, it violates the principles of ‘tidy data’. Think about how this data should look to plot it most easily. If we have an x
and a y
axis to plot what would they be, and can we match a single variable to x
and another to y
? If we wanted to perform a single operation, like rounding to the nearest hundred, to all the important values (the attendance at each observation) how could the data be laid out to make this (and everything else we do to our data) easier?
The principles of tidy data were described by Hadley Wickham and you can learn more about them here. The main principle of tidy data is that each variable is column and each observation is a row. Does the data above follow this principle? Even though the data above is easy to look at, and would probably how you would format the data in a presentation, it is not tidy.
names(LeisureAtt) <- gsub("X", "", names(LeisureAtt)) # First let's get rid of that pesky "X"
library(tidyr) # A package with tidying functions like gather(), and its reverse, spread()
LeisureAtt <- gather(LeisureAtt, YEAR, ATTENDANCE, 2:7)
head(LeisureAtt, 3)
## MONTH YEAR ATTENDANCE
## 1 APRIL 2011 323137
## 2 AUGUST 2011 291265
## 3 DECEMBER 2011 263062
Did this do what we want? Any problems? Let’s take a look at a simple plot. (We’ll cover plotting later).
library(ggplot2) # function ggplot()
ggplot(LeisureAtt, aes(x = MONTH, y = ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=60, hjust = 1, vjust=1, size=8)) + facet_wrap(~ YEAR, ncol = 3)
Of course, we can fix this by changing the type of factor we have for MONTH
.
str(LeisureAtt$MONTH) # If we had used read_csv() this would be a character vector, but we wouldn't have had the "X" on the year names.
## Factor w/ 12 levels "APRIL","AUGUST",..: 1 2 3 4 5 6 7 8 9 10 ...
levels(LeisureAtt$MONTH)
## [1] "APRIL" "AUGUST" "DECEMBER" "FEBRUARY" "JANUARY"
## [6] "JULY" "JUNE" "MARCH" "MAY" "NOVEMBER"
## [11] "OCTOBER" "SEPTEMBER"
LeisureAtt$MONTH <- factor(LeisureAtt$MONTH, levels = toupper(month.name)) # Note about the toupper(month.name)
levels(LeisureAtt$MONTH)
## [1] "JANUARY" "FEBRUARY" "MARCH" "APRIL" "MAY"
## [6] "JUNE" "JULY" "AUGUST" "SEPTEMBER" "OCTOBER"
## [11] "NOVEMBER" "DECEMBER"
head(LeisureAtt)
## MONTH YEAR ATTENDANCE
## 1 APRIL 2011 323137
## 2 AUGUST 2011 291265
## 3 DECEMBER 2011 263062
## 4 FEBRUARY 2011 271124
## 5 JANUARY 2011 230367
## 6 JULY 2011 320698
Still looks the same, but if we plot it again, we see that we get what we want.
ggplot(LeisureAtt, aes(x = MONTH, y = ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=60, hjust = 1, vjust=1, size=8)) + facet_wrap(~ YEAR, ncol = 3)
Because our data was tidy(ish), plotting is much simpler. The data for x
and y
are consolidated into separate columns. This might be harder for us to look at, but when we start dealing with larger datasets, ‘looking’ is not going to benefit us rather we will want to use plots or functions like range()
, summary()
, mean()
, summary()
, etc, to get a sense of what our dataset contains. These operations are much easier to run, and good for us conceptually, when variables are represented in single columns.
Edmonton has a great open data portal, which you can find here: https://data.edmonton.ca/. Let’s take a look at what the Leisure Centre Attendance dataset looked like when I found it.
LeisureAttend <- read.csv(url("https://dashboard.edmonton.ca/api/views/iaa7-x8kk/rows.csv"))
head(LeisureAttend, 5)[,c(2:5,7)] # I'm omitting some columns for aesthetic reasons
## DateTime MONTH_NUMBER MONTH YEAR MONTHLY_ATTENDANCE
## 1 01/01/2011 12:00:00 AM 1 JANUARY 2011 0
## 2 01/31/2011 12:00:00 AM 1 JANUARY 2011 230367
## 3 02/28/2011 12:00:00 AM 2 FEBRUARY 2011 271124
## 4 03/31/2011 12:00:00 AM 3 MARCH 2011 337191
## 5 04/30/2011 12:00:00 AM 4 APRIL 2011 323137
This dataset, unlike the one I created mostly conforms with the principles of tidy data, but there is something else odd here. What is it and what can we do about it? Without thinking of functions, what are some way we could modify this dataframe?
str(LeisureAttend)
## 'data.frame': 84 obs. of 8 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DateTime : Factor w/ 83 levels "01/01/2011 12:00:00 AM",..: 1 7 14 21 28 36 42 48 54 60 ...
## $ MONTH_NUMBER : int 1 1 2 3 4 5 6 7 8 9 ...
## $ MONTH : Factor w/ 12 levels "APRIL","AUGUST",..: 5 5 4 8 1 9 7 6 2 12 ...
## $ YEAR : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ REPORT_PERIOD : Factor w/ 77 levels "11-Apr","11-Aug",..: 5 5 4 8 1 9 7 6 2 12 ...
## $ MONTHLY_ATTENDANCE: int 0 230367 271124 337191 323137 314676 304455 320698 291265 243688 ...
## $ TARGET : int NA NA NA NA NA NA NA NA NA NA ...
range(LeisureAttend$MONTHLY_ATTENDANCE)
## [1] 0 592611
LeisureAttend <- filter(LeisureAttend, MONTHLY_ATTENDANCE != 0)
ggplot(LeisureAttend, aes(x = REPORT_PERIOD, y = MONTHLY_ATTENDANCE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=60, hjust = 1, vjust=1, size=8))
We will do more with this data when we get to plotting.
What we are about to do is, in this case, completely unnecessary, but is useful to know. In the MTcars
dataset some of the values are not very intuitive. Transmission type is denoted by a numeric value where it might be easier to read if those values were abbreviations like ‘auto’ and ‘man’. (I always forget what vs
means, - I think it is the cylinder arrangement inline cylinders or not.) Let’s change the numeric values in MTcars$am
to strings ‘auto’ and ‘man’. And let’s do this a slightly safer way.
xx <- MTcars$am
xx[xx == 0] <- "auto"
xx[MTcars$am == 1] <- "man" # This is a bit odd to do, but I did it to illustrate a point. Please ask, why?
MTcars$amChar <- xx # What did this do?
rm(xx)
identical((MTcars$am == 0), (MTcars$amChar == "auto"))
## [1] TRUE
table((MTcars$am == 0) == (MTcars$amChar == "auto"))
##
## TRUE
## 32
What if we wanted to create factor variable for engine size called eng_size
, where engines with a displacement one standard deviation above than the mean are classed “big”, below “small”, and within “avr” for ‘average’? We could do this with a ‘for’ loop and some ‘if/else’ statements like below.
xx <- character()
for (i in 1:nrow(MTcars)) {
if (MTcars$disp[i] > (mean(MTcars$disp) + sd(MTcars$disp))) {
xx[i] <- "big"
} else if (MTcars$disp[i] < (mean(MTcars$disp) - sd(MTcars$disp))) {
xx[i] <- "small"
} else {
xx[i] <- "avr"
}
}
MTcars$eng_size <- factor(xx, levels = c("small", "avr", "big"), ordered = TRUE)
str(MTcars$eng_size)
## Ord.factor w/ 3 levels "small"<"avr"<..: 2 2 2 2 3 2 3 2 2 2 ...
It is good to know about how to write ‘for’, ‘while’ and ‘if’ statements when you learn about programming in general, but I want to teach you how to use R. The way to use R properly is to avoid using these whenever possible. An experienced R programmer would know there is a function that does this for you (one that I should have, but did not know about for the first two years I used R). It is called ifelse()
.
# ifelse("Is this TRUE?", "Yes, so = ", "No, so = ")
xx <- ifelse(MTcars$disp > (mean(MTcars$disp) + sd(MTcars$disp)), "big", ifelse(MTcars$disp < (mean(MTcars$disp) - sd(MTcars$disp)), "small", "avr"))
xx <- factor(xx, levels = c("small", "avr", "big"), ordered = TRUE)
identical(xx, MTcars$eng_size)
## [1] TRUE
Not only is ifelse()
easier to write and read (i.e., elegant), it is also much, much, much, much faster than using a ‘for’ loop. R is not a fast programming language by any means. While ifelse()
runs within R, many functions like those in the dplyr
package actually run outside of R, in this case in C++. Finding a specific function in another R package can sometimes turn an hour of computational time into seconds, literally. (You can wrap a function call in system.time()
to compare computation speeds).
When we changed the zeros to “auto”, the vector xx
was changed into a character vector. If we try to do this with factors we will have trouble.
xx <- MTcars$eng_size
xx[xx == "big"] <- "large"
## Warning in `[<-.factor`(`*tmp*`, xx == "big", value = "large"): invalid
## factor level, NA generated
The solution here is to use a line of code like this:
xx <- MTcars$eng_size
levels(xx)[levels(xx) == "big"] <- "large"
Of course, we could also just turn xx
into a character vector and modify it that way too.
Factors also have another annoying habit. They will not go away even after they are removed. Let’s pretend we only want the summer months from ’LeisureAttend`.
LeisAttSummer <- filter(LeisureAttend, MONTH %in% c("JULY", "AUGUST", "SEPTEMBER"))
str(LeisAttSummer$MONTH)
## Factor w/ 12 levels "APRIL","AUGUST",..: 6 2 12 6 2 12 2 12 6 6 ...
We still have a factor with 12 levels. The solution is to use droplevels()
. We could have just added this onto the tail of the original filter()
call.
LeisAttSummer <- droplevels(filter(LeisureAttend, MONTH %in% c("JULY", "AUGUST", "SEPTEMBER")))
str(LeisAttSummer$MONTH)
## Factor w/ 3 levels "AUGUST","JULY",..: 2 1 3 2 1 3 1 3 2 2 ...
%>%
The pipe was an innovation originally from the magrittr
package, but is now frequently used in many packages especially dplyr
and tidyr
. Code can be conceptually hard to read as it typically becomes more and more embedded, meaning that the first thing your code is doing is usually in the centre within the innermost “()”. With the pipe operator, you can mostly think of your code to mean “Take these data, and then (%>%), and then (%>%)…”. How would we re-write the code above without the pipe?
LeisAttSummer <-