Introduction
As a Social Psychologist first trained in SPSS, I am used to collecting and organizing my data in wide format. When data in is wide format, a subject’s responses will be in a single row, and each response is in separate columns. However R prefers long format. We could melt
and cast
with reshape2
to reshape from wide to long format, but is there a way to reshape using even less code? Luckily for us, Hadley Wickham has created the easy to use tidyr
!
tidyr
allow us to quickly and easily tidy and reorganize our data for all sorts of analyses. This is particularly helpful with a disorganized dataset. tidyr
is built for this function, and thus does less than reshape2. Specifically, tidyr
can only be used with exisiting dataframes, and cannot aggregate.
In this chapter, I will go over the hallmark functions of tidyr
: gather()
, separate()
, unite()
, and spread()
.
First let’s install and call up the tidyr
package. We will also need to use the dplyr
package.
#install.packages("tidyr") # I have used "#" to "comment out" this line for this tutorial. Just take away the first "#" and you are good to go!
#install.packages("dplyr")
library(tidyr)
library(dplyr)
%>%
Why do we need dplyr
? dplyr
is a grammar of data manipulation. We need dplyr
to use the pipe operator, %>%
, in our code. %>%
is not required to use tidyr, but it does make things easier!
%>%
allows you to pipe a value forward into an expression or to function call; such that x %>% f
, instead of f(x)
. This short hand was created by Stefan Milton Bache with the magrittr
package. To read more about this function, click here
The Dataframe
Here I have created a messy wide dataset. Feel free to use it to follow along!
In this example study, participants were asked to categorize three faces by clicking various buttons that represent three different categories. The time it took to click a button is in milliseconds.
n=10
wide <- data.frame(
ID = c(1:n),
Face.1 = c(411,723,325,456,579,612,709,513,527,379),
Face.2 = c(123,300,400,500,600,654,789,906,413,567),
Face.3 = c(1457,1000,569,896,956,2345,780,599,1023,678)
)
This dataset I created is messy; As you can see below, only ID is in a column, Response time split between three columns, such that responses are in both rows and columns (by ID and Face.1, Face.2, and Face.3).
What we want instead is one column for the condition (Face.1, Face.2, or Face.3) responses and a column for response time, with each row being a singualar observation for each participant. Participant IDs should repeat as this is a within subject design (each participant saw each face).
## ID Face.1 Face.2 Face.3
## 1 1 411 123 1457
## 2 2 723 300 1000
## 3 3 325 400 569
## 4 4 456 500 896
## 5 5 579 600 956
## 6 6 612 654 2345
## 7 7 709 789 780
## 8 8 513 906 599
## 9 9 527 413 1023
## 10 10 379 567 678
Gather()
By using the gather()
function, we can transform the data from wide to long Here is the generic code for gather()
:
#gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
Gather() Arguments
Whoa! What does this all mean? Let’s find out more about the arguments of gather()
:
data
: Your data frame.
key, value
: The unquoted new names of key and value columns to create in the output. The key will become the name of the condition/IV column, and value will become the name of the response/DV column.
...
: The columns to gather. Use the exisiting variable names. Select a range of variables with :
(e.g. if you have variables a, b, c, and d, and want to select all of these varibeles you will indicate this with a:d). If you want to exclude a variable, use -
(e.g. exclude y with -y).
na.rm
: If you indicate that na.rm=TRUE, this will remove rows from the output where the value is missing.
convert
: If TRUE this will automatically convert the key column to a logical, integer, numeric, complex, or factor as appropriate. This is useful if the column names are actually numeric, integer, or logical.
factor_key
: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.
Using Gather()
Now that we have a better understanding of the arguements, lets make our data set long using gather()
!
long <- wide %>% gather(Face, ResponseTime, Face.1:Face.3)
## ID Face ResponseTime
## 1 1 Face.1 411
## 2 2 Face.1 723
## 3 3 Face.1 325
## 4 4 Face.1 456
## 5 5 Face.1 579
## 6 6 Face.1 612
## 7 7 Face.1 709
## 8 8 Face.1 513
## 9 9 Face.1 527
## 10 10 Face.1 379
## 11 1 Face.2 123
## 12 2 Face.2 300
## 13 3 Face.2 400
## 14 4 Face.2 500
## 15 5 Face.2 600
## 16 6 Face.2 654
## 17 7 Face.2 789
## 18 8 Face.2 906
## 19 9 Face.2 413
## 20 10 Face.2 567
## 21 1 Face.3 1457
## 22 2 Face.3 1000
## 23 3 Face.3 569
## 24 4 Face.3 896
## 25 5 Face.3 956
## 26 6 Face.3 2345
## 27 7 Face.3 780
## 28 8 Face.3 599
## 29 9 Face.3 1023
## 30 10 Face.3 678
As you can see, now we have two columns: One for the the Faces, and one for response time. Each participant saw each face, so ID repeats three times.
Separate()
Although the long dataset we created using gather()
is acceptable for use, we can break down the face variable even further using separate()
.
Here is the generic code for separate()
:
#separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")
Separate() Arguments
What are the arugments unique to separate()
?
col
: Unquoted name of the column to be separated.
into
: Names for the new variables that you are separating out from the column.
sep
: Separator between columns. If the seprator is a character, it is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. In the example, each face is indicated by a number that follows a period (.
). I do not need speifcy this because this exists in each level of Face. If numeric, it is interpreted as the position to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep
should be one less than into
.
remove
: If this is TRUE, it removes the input column from output data frame.
extra
: If sep
is a character vector (like .
), this controls what happens when there are too many pieces (e.g. Face.1.A, rather than Face.1). There are three valid options: “warn” (the default): emit a warning and drop extra values. “drop”: drop any extra values without a warning. “merge”: only splits at most length(into) times
fill
: If sep
is a character vector (like .
), this controls what happens when there are not enough pieces (e.g. Face1, rather than Face.1). There are three valid options: “warn” (the default): emit a warning and fill from the right “right”: fill with missing values on the right “left”: fill with missing values on the left
Using Separate()
Each face is indicated by number after a period. This variable annotation allows us to separate the face variable into two. By using the separate
function of tidyr
we can tease apart single variables which sometimes capture multiple variables (or sometimes redundant information).
In this case, I want to split the Face from the number attached to it, which in this example represents the race of the face.
long_separate <- long %>% separate(Face, c("Target", "Number"))
## ID Target Number ResponseTime
## 1 1 Face 1 411
## 2 2 Face 1 723
## 3 3 Face 1 325
## 4 4 Face 1 456
## 5 5 Face 1 579
## 6 6 Face 1 612
## 7 7 Face 1 709
## 8 8 Face 1 513
## 9 9 Face 1 527
## 10 10 Face 1 379
## 11 1 Face 2 123
## 12 2 Face 2 300
## 13 3 Face 2 400
## 14 4 Face 2 500
## 15 5 Face 2 600
## 16 6 Face 2 654
## 17 7 Face 2 789
## 18 8 Face 2 906
## 19 9 Face 2 413
## 20 10 Face 2 567
## 21 1 Face 3 1457
## 22 2 Face 3 1000
## 23 3 Face 3 569
## 24 4 Face 3 896
## 25 5 Face 3 956
## 26 6 Face 3 2345
## 27 7 Face 3 780
## 28 8 Face 3 599
## 29 9 Face 3 1023
## 30 10 Face 3 678
Now, We have two columns, one for Target, the values of which are all “Face”, and one for Number, which indicates which of the three faces it is.
Unite
To undo separate()
, we can use unite()
, which merges two variables into one.
Here is the generic code for unite()
:
#unite(data, col, ..., sep = ".", remove = TRUE)
Unite() Arguments
Here are the arugments unique to unite()
:
sep
: In the code for unite()
the sep
indicated the separator we choose to to use to bind values. In this case, we are using .
Using Unite()
long_unite <- long_separate %>% unite(Face, Target, Number, sep = ".")
## ID Face ResponseTime
## 1 1 Face.1 411
## 2 2 Face.1 723
## 3 3 Face.1 325
## 4 4 Face.1 456
## 5 5 Face.1 579
## 6 6 Face.1 612
## 7 7 Face.1 709
## 8 8 Face.1 513
## 9 9 Face.1 527
## 10 10 Face.1 379
## 11 1 Face.2 123
## 12 2 Face.2 300
## 13 3 Face.2 400
## 14 4 Face.2 500
## 15 5 Face.2 600
## 16 6 Face.2 654
## 17 7 Face.2 789
## 18 8 Face.2 906
## 19 9 Face.2 413
## 20 10 Face.2 567
## 21 1 Face.3 1457
## 22 2 Face.3 1000
## 23 3 Face.3 569
## 24 4 Face.3 896
## 25 5 Face.3 956
## 26 6 Face.3 2345
## 27 7 Face.3 780
## 28 8 Face.3 599
## 29 9 Face.3 1023
## 30 10 Face.3 678
As you can see the data now looks like it did when we first transfromed from wide to long using gather()
!
Spread
Finally, we will transform the data from long back to wide with the spread()
function.
Here is the generic code for spread()
#spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
Spread() Arguments
The arugments unique to spread()
:
key
: The unquoted name of the column whose values will be used as column headings.
value
: The unquoted name of the column whose values will populate the cells.
fill
: If used, missing values will be replaced with this value. There are two types of missing in the input: explicit missing values (i.e. NA
), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill
.
convert
: If TRUE
, this will automatically convert the new columns to a logical, integer, numeric, complex, or factor as appropriate.
drop
: If FALSE
, will keep factor levels that don’t appear in the data, filling in missing combinations with fill
.
sep
: If NULL
, the column names will be taken from the values of key variable. If non-NULL
, the column names will be created by stringing together the name, separator, and value.
back_to_wide <- long_unite %>% spread(Face, ResponseTime)
## ID Face.1 Face.2 Face.3
## 1 1 411 123 1457
## 2 2 723 300 1000
## 3 3 325 400 569
## 4 4 456 500 896
## 5 5 579 600 956
## 6 6 612 654 2345
## 7 7 709 789 780
## 8 8 513 906 599
## 9 9 527 413 1023
## 10 10 379 567 678
And there we have it! We have come full circle back into wide.
