1 Introduction

As a Social Psychologist first trained in SPSS, I am used to collecting and organizing my data in wide format. When data in is wide format, a subject’s responses will be in a single row, and each response is in separate columns. However R prefers long format. We could melt and cast with reshape2 to reshape from wide to long format, but is there a way to reshape using even less code? Luckily for us, Hadley Wickham has created the easy to use tidyr!

tidyr allow us to quickly and easily tidy and reorganize our data for all sorts of analyses. This is particularly helpful with a disorganized dataset. tidyr is built for this function, and thus does less than reshape2. Specifically, tidyr can only be used with exisiting dataframes, and cannot aggregate.

In this chapter, I will go over the hallmark functions of tidyr: gather(), separate(), unite(), and spread().

First let’s install and call up the tidyr package. We will also need to use the dplyr package.

#install.packages("tidyr") # I have used "#" to "comment out" this line for this tutorial. Just take away the first "#" and you are good to go!
#install.packages("dplyr")

library(tidyr)
library(dplyr)

1.1 %>%

Why do we need dplyr? dplyr is a grammar of data manipulation. We need dplyr to use the pipe operator, %>%, in our code. %>% is not required to use tidyr, but it does make things easier!

%>% allows you to pipe a value forward into an expression or to function call; such that x %>% f, instead of f(x). This short hand was created by Stefan Milton Bache with the magrittr package. To read more about this function, click here

1.2 The Dataframe

Here I have created a messy wide dataset. Feel free to use it to follow along!

In this example study, participants were asked to categorize three faces by clicking various buttons that represent three different categories. The time it took to click a button is in milliseconds.

n=10
wide <- data.frame(
  ID = c(1:n),
  Face.1 = c(411,723,325,456,579,612,709,513,527,379),
  Face.2 = c(123,300,400,500,600,654,789,906,413,567),
  Face.3 = c(1457,1000,569,896,956,2345,780,599,1023,678)
)

This dataset I created is messy; As you can see below, only ID is in a column, Response time split between three columns, such that responses are in both rows and columns (by ID and Face.1, Face.2, and Face.3).

What we want instead is one column for the condition (Face.1, Face.2, or Face.3) responses and a column for response time, with each row being a singualar observation for each participant. Participant IDs should repeat as this is a within subject design (each participant saw each face).

##    ID Face.1 Face.2 Face.3
## 1   1    411    123   1457
## 2   2    723    300   1000
## 3   3    325    400    569
## 4   4    456    500    896
## 5   5    579    600    956
## 6   6    612    654   2345
## 7   7    709    789    780
## 8   8    513    906    599
## 9   9    527    413   1023
## 10 10    379    567    678

2 Gather()

By using the gather() function, we can transform the data from wide to long Here is the generic code for gather():

#gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

2.1 Gather() Arguments

Whoa! What does this all mean? Let’s find out more about the arguments of gather():

  • data: Your data frame.

  • key, value: The unquoted new names of key and value columns to create in the output. The key will become the name of the condition/IV column, and value will become the name of the response/DV column.

  • ...: The columns to gather. Use the exisiting variable names. Select a range of variables with : (e.g. if you have variables a, b, c, and d, and want to select all of these varibeles you will indicate this with a:d). If you want to exclude a variable, use - (e.g. exclude y with -y).

  • na.rm: If you indicate that na.rm=TRUE, this will remove rows from the output where the value is missing.

  • convert: If TRUE this will automatically convert the key column to a logical, integer, numeric, complex, or factor as appropriate. This is useful if the column names are actually numeric, integer, or logical.

  • factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

2.2 Using Gather()

Now that we have a better understanding of the arguements, lets make our data set long using gather()!

long <- wide %>% gather(Face, ResponseTime, Face.1:Face.3)
##    ID   Face ResponseTime
## 1   1 Face.1          411
## 2   2 Face.1          723
## 3   3 Face.1          325
## 4   4 Face.1          456
## 5   5 Face.1          579
## 6   6 Face.1          612
## 7   7 Face.1          709
## 8   8 Face.1          513
## 9   9 Face.1          527
## 10 10 Face.1          379
## 11  1 Face.2          123
## 12  2 Face.2          300
## 13  3 Face.2          400
## 14  4 Face.2          500
## 15  5 Face.2          600
## 16  6 Face.2          654
## 17  7 Face.2          789
## 18  8 Face.2          906
## 19  9 Face.2          413
## 20 10 Face.2          567
## 21  1 Face.3         1457
## 22  2 Face.3         1000
## 23  3 Face.3          569
## 24  4 Face.3          896
## 25  5 Face.3          956
## 26  6 Face.3         2345
## 27  7 Face.3          780
## 28  8 Face.3          599
## 29  9 Face.3         1023
## 30 10 Face.3          678

As you can see, now we have two columns: One for the the Faces, and one for response time. Each participant saw each face, so ID repeats three times.

3 Separate()

Although the long dataset we created using gather() is acceptable for use, we can break down the face variable even further using separate().

Here is the generic code for separate():

#separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")

3.1 Separate() Arguments

What are the arugments unique to separate()?

  • col: Unquoted name of the column to be separated.

  • into: Names for the new variables that you are separating out from the column.

  • sep: Separator between columns. If the seprator is a character, it is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. In the example, each face is indicated by a number that follows a period (.). I do not need speifcy this because this exists in each level of Face. If numeric, it is interpreted as the position to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.

  • remove: If this is TRUE, it removes the input column from output data frame.

  • extra: If sep is a character vector (like .), this controls what happens when there are too many pieces (e.g. Face.1.A, rather than Face.1). There are three valid options: “warn” (the default): emit a warning and drop extra values. “drop”: drop any extra values without a warning. “merge”: only splits at most length(into) times

  • fill: If sep is a character vector (like .), this controls what happens when there are not enough pieces (e.g. Face1, rather than Face.1). There are three valid options: “warn” (the default): emit a warning and fill from the right “right”: fill with missing values on the right “left”: fill with missing values on the left

3.2 Using Separate()

Each face is indicated by number after a period. This variable annotation allows us to separate the face variable into two. By using the separate function of tidyr we can tease apart single variables which sometimes capture multiple variables (or sometimes redundant information).

In this case, I want to split the Face from the number attached to it, which in this example represents the race of the face.

long_separate <- long %>% separate(Face, c("Target", "Number"))
##    ID Target Number ResponseTime
## 1   1   Face      1          411
## 2   2   Face      1          723
## 3   3   Face      1          325
## 4   4   Face      1          456
## 5   5   Face      1          579
## 6   6   Face      1          612
## 7   7   Face      1          709
## 8   8   Face      1          513
## 9   9   Face      1          527
## 10 10   Face      1          379
## 11  1   Face      2          123
## 12  2   Face      2          300
## 13  3   Face      2          400
## 14  4   Face      2          500
## 15  5   Face      2          600
## 16  6   Face      2          654
## 17  7   Face      2          789
## 18  8   Face      2          906
## 19  9   Face      2          413
## 20 10   Face      2          567
## 21  1   Face      3         1457
## 22  2   Face      3         1000
## 23  3   Face      3          569
## 24  4   Face      3          896
## 25  5   Face      3          956
## 26  6   Face      3         2345
## 27  7   Face      3          780
## 28  8   Face      3          599
## 29  9   Face      3         1023
## 30 10   Face      3          678

Now, We have two columns, one for Target, the values of which are all “Face”, and one for Number, which indicates which of the three faces it is.

4 Unite

To undo separate(), we can use unite(), which merges two variables into one.

Here is the generic code for unite():

#unite(data, col, ..., sep = ".", remove = TRUE)

4.1 Unite() Arguments

Here are the arugments unique to unite():

  • sep: In the code for unite() the sep indicated the separator we choose to to use to bind values. In this case, we are using .

4.2 Using Unite()

long_unite <- long_separate %>% unite(Face, Target, Number, sep = ".")
##    ID   Face ResponseTime
## 1   1 Face.1          411
## 2   2 Face.1          723
## 3   3 Face.1          325
## 4   4 Face.1          456
## 5   5 Face.1          579
## 6   6 Face.1          612
## 7   7 Face.1          709
## 8   8 Face.1          513
## 9   9 Face.1          527
## 10 10 Face.1          379
## 11  1 Face.2          123
## 12  2 Face.2          300
## 13  3 Face.2          400
## 14  4 Face.2          500
## 15  5 Face.2          600
## 16  6 Face.2          654
## 17  7 Face.2          789
## 18  8 Face.2          906
## 19  9 Face.2          413
## 20 10 Face.2          567
## 21  1 Face.3         1457
## 22  2 Face.3         1000
## 23  3 Face.3          569
## 24  4 Face.3          896
## 25  5 Face.3          956
## 26  6 Face.3         2345
## 27  7 Face.3          780
## 28  8 Face.3          599
## 29  9 Face.3         1023
## 30 10 Face.3          678

As you can see the data now looks like it did when we first transfromed from wide to long using gather()!

5 Spread

Finally, we will transform the data from long back to wide with the spread() function.

Here is the generic code for spread()

#spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

5.1 Spread() Arguments

The arugments unique to spread():

  • key: The unquoted name of the column whose values will be used as column headings.

  • value: The unquoted name of the column whose values will populate the cells.

  • fill: If used, missing values will be replaced with this value. There are two types of missing in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill.

  • convert: If TRUE, this will automatically convert the new columns to a logical, integer, numeric, complex, or factor as appropriate.

  • drop: If FALSE, will keep factor levels that don’t appear in the data, filling in missing combinations with fill.

  • sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be created by stringing together the name, separator, and value.

back_to_wide <- long_unite %>% spread(Face, ResponseTime)
##    ID Face.1 Face.2 Face.3
## 1   1    411    123   1457
## 2   2    723    300   1000
## 3   3    325    400    569
## 4   4    456    500    896
## 5   5    579    600    956
## 6   6    612    654   2345
## 7   7    709    789    780
## 8   8    513    906    599
## 9   9    527    413   1023
## 10 10    379    567    678

And there we have it! We have come full circle back into wide.

6 References

https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ # Who is this Hadley Wickham guy?

https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html # Further reading about %>%

https://blog.rstudio.org/2014/07/22/introducing-tidyr/ # Helpful overview of tidyr

http://ademos.people.uic.edu/Chapter8.html # Tim Carsel’s chapter on reshape2

https://cran.r-project.org/web/packages/tidyr/tidyr.pdf # A full guide of tidy r and all the arguments for each function of the package

http://garrettgman.github.io/tidying/ # Different types of messy data and how to fix with tidyr

