1 What is plyr?

Plyr is a R package that makes it easy to split apart, manipulate, and put back together data.

1.1 Why is it better than split and apply?

Plyr is easier because it uses the common syntax and faster because it uses an input-output format requiring less code. Additionally there are few apply functions that allow dataframes to be the input and output, whereas plyr is mainly used to manipulate dataframes. Plyr can also summarize dataframes into new dataframes, which can be useful when extracting values from large datasets. If you are not familiar with the split and apply method please refer to Chapter 4.

2 Plyr basics

2.1 Plyr setup and install

To start using plyr install and load the plyr library.
All examples in this chapter will be based on the Weather dataframe below. This dataset consists of 5 columns and 364 rows; the first three columns are the date and month followed by the season that particular day is in. The last two columns are the average daily temperature and percipitation for each date. If you are not familiar with generating dataframes please refer to Chapter 1

## intall.packages(plyr)
  
library(plyr) 

#Building the Weather dataframe
NJanuary = 31 
NFebruary = 28
NMarch=31
NApril=30
NMay=31
NJune=30
NJuly=31
NAugust=30
NSeptember=31
NOctober=30
NNovember=31
NDecember=30

Day<-c(seq(1:NJanuary),seq(1:NFebruary),seq(1:NMarch),seq(1:NApril),seq(1:NMay),seq(1:NJune),seq(1:NJuly),seq(1:NAugust),seq(1:NSeptember), seq(1:NOctober),seq(1:NNovember),seq(1:NDecember))

Month<-c(rep("January",NJanuary),rep("February",NFebruary),rep("March",NMarch),rep("April",NApril),rep("May",NMay),rep("June",NJune),rep("July",NJuly),rep("August",NAugust),rep("September",NSeptember),rep("October",NOctober),rep("November",NNovember),rep("December",NDecember))

Season<-c(rep("Winter",sum(NJanuary+NFebruary+20)),rep("Spring",sum(NMarch+NApril+NMay)),rep("Summer",sum(NJune+NJuly+NAugust+2)),rep("Fall",sum(NSeptember+NOctober+NNovember-1)),rep("Winter",sum(NDecember-21)))

set.seed(4)
Temperature<-c(rnorm(NJanuary,mean=21,sd=10.2),c(rnorm(NFebruary,mean=25,sd=8.6)),c(rnorm(NMarch,mean=37,sd=12.2)),c(rnorm(NApril,mean=50,sd=10.1)),c(rnorm(NMay,mean=59,sd=7)),c(rnorm(NJune,mean=70,sd=8.6)),c(rnorm(NJuly,mean=73,sd=11.4)),c(rnorm(NAugust,mean=72,sd=12.3)),c(rnorm(NSeptember,mean=64,sd=11.6)),c(rnorm(NOctober,mean=54,sd=7.8)),c(rnorm(NNovember,mean=42,sd=13.4)),c(rnorm(NDecember,mean=27,sd=8.6)))

set.seed(4)
Percipitation<-c(rnorm(NJanuary,mean=1.73,sd=.3),c(rnorm(NFebruary,mean=1.79,sd=.3)),c(rnorm(NMarch,mean=2.5,sd=1.4)),c(rnorm(NApril,mean=3.38,sd=.5)),c(rnorm(NMay,mean=3.68,sd=.9)),c(rnorm(NJune,mean=3.45,sd=1.2)),c(rnorm(NJuly,mean=3.7,sd=.8)),c(rnorm(NAugust,mean=4.9,sd=2.7)),c(rnorm(NSeptember,mean=3.21,sd=1.4)),c(rnorm(NOctober,mean=3.15,sd=.9)),c(rnorm(NNovember,mean=3.15,sd=1.3)),c(rnorm(NDecember,mean=2.24,sd=1.1)))

Weather<-data.frame(
  Day = Day,
  Month = Month,
  Season = Season,
  Temperature = Temperature,
  Percipitation = Percipitation
  )

head(Weather)
##   Day   Month Season Temperature Percipitation
## 1   1 January Winter    23.21090      1.795026
## 2   2 January Winter    15.46658      1.567252
## 3   3 January Winter    30.08968      1.997343
## 4   4 January Winter    27.07900      1.908794
## 5   5 January Winter    37.68330      2.220685
## 6   6 January Winter    28.03061      1.936783

2.2 Basic syntax

Plyr uses the base ply() which is preceded by two letters that indicate the format of the data. The first letter denotes the format in and the second letter specifies the format out. Plyr requires the specifications of data, variable, and function after the ply base. Below are the most common data formats for plyr:

  • d = dataframe
  • a = array/matrix
  • l = list

ddply is the most commonly used format. As the first 2 letters indicate, ddply takes an existing dataframe, splits it apart, extracts some information from it, and either makes a new dataframe or alters the original dataframe to include the new information.

2.3 Example

ddply on its own can be used to sort the dataframe by a specific column without losing the corresponding values in other columns. The example below takes the Weather dataframe and generates a new dataframe, Temperature.dd, that is sorted by increasing temperature. This is done using the default function which is null. The original Weather dataframe could also have been resorted by recreating the object Weather instead of Temperature.dd

Temperature.dd<-ddply(Weather, .(Temperature))

head(Temperature.dd)
##   Day    Month Season Temperature Percipitation
## 1   7  January Winter    7.931284     1.3456260
## 2  19 February Winter    9.542515     1.2507854
## 3  13 February Winter   10.482782     1.2835854
## 4  11 November   Fall   11.523199     0.1932954
## 5  29  January Winter   11.534113     1.4515916
## 6   9 December   Fall   11.986117     0.3196197

3 Common functions

The 3 plyr functions that are most commonly used are transform, summarise, and mutate. These functions are called for within the plyr command.

3.1 Transform

Transform modifies an existing dataframe. This can be useful if you want to compute values for an additional column in the dataframe based on information already available.

3.1.1 Transform example

In this example we will add the additional variable of average monthly temperature to the Weather dataframe. To do this, trasform will be called for in place of the function and then the new column name and the calculation of the new variable will be calculated. Note that the new column will be called mean.month.

Weather.1<-ddply(Weather,.(Month), transform,
               mean.month=mean(Temperature))

head(Weather.1)
##   Day Month Season Temperature Percipitation mean.month
## 1   1 April Spring    37.09563      2.741170   50.66127
## 2   2 April Spring    41.94007      2.980994   50.66127
## 3   3 April Spring    51.60673      3.459541   50.66127
## 4   4 April Spring    56.20946      3.687399   50.66127
## 5   5 April Spring    56.94827      3.723974   50.66127
## 6   6 April Spring    49.52478      3.356474   50.66127

3.2 Summarise

Summarise creates a new condensed dataframe that can summarise information about the existing dataframe. This is a similar tool to pivot tables in Excel except it allows you to summarise a wider range of variables. Please note that summarise is spelt with an ā€œsā€.

3.2.1 Summarise example 1

In this example we will summarise the average monthly temperature, the average monthly percipitation, and the standard deviation of both into new columns. The format of this is very similar to transform in that summarise is called for in place of the plyr function and then the names of the new columns as well as the formulas needed to generate the new columns are specified after.

Weather.summary<-ddply(Weather, .(Month), summarise,
                       avg.temp=mean(Temperature),
                       sd.temp=sd(Temperature),
                       avg.perc=mean(Percipitation),
                       sd.perc=sd(Percipitation)
                       )

head(Weather.summary)
##      Month avg.temp   sd.temp avg.perc   sd.perc
## 1    April 50.66127  9.717198 3.412736 0.4810494
## 2   August 71.65274 10.372217 4.823772 2.2768282
## 3 December 27.09597  6.941381 2.252275 0.8878510
## 4 February 24.20208  8.082064 1.762166 0.2819325
## 5  January 25.85950  8.122852 1.872927 0.2389074
## 6     July 72.94086 11.973996 3.695850 0.8402804

3.2.2 Summarise example 2

The next example will summarise the average temperature and percipitation per season. The code for this is very similar to the first summarise example except we will change the variable that items are summarised by to Season

In this example we will summarise the average monthly temperature, the average monthly percipitation, and the standard deviation of both into new columns. The format of this is very similar to transform in that summarise is called for in place of the plyr function and then the names of the new columns as well as the formulas needed to generate the new columns are specified after.

Weather.summary2<-ddply(Weather, .(Season), summarise,
                       avg.temp=mean(Temperature),
                       avg.perc=mean(Percipitation)
                       )

head(Weather.summary2)
##   Season avg.temp avg.perc
## 1   Fall 44.35078 2.912997
## 2 Spring 55.08098 3.285695
## 3 Summer 69.66207 3.858623
## 4 Winter 28.25241 2.050919

3.3 Mutate

Mutate is a very powerful tool that is similar to transform and summarise, but it allows variables computed within the command to be used immediately to compute other variables, all of which can be incorporated back into the original data frame.

3.3.1 Mutate example

This example creates columns that calculate the average monthly temperature, percipitation and the standard deviation of these variables. The newly calculated standard deviations are then used to generate columns calculating the standard errors.

Weather<-ddply(Weather, .(Month), mutate,
                      avg.temp=mean(Temperature),
                      sd.temp=sd(Temperature),
                      se.temp=sd.temp/sqrt(length(Month)),
                      avg.perc=mean(Percipitation),
                      sd.perc=sd(Percipitation),
                      se.perc=sd.perc/sqrt(length(Month))
                       )

head(Weather)
##   Day Month Season Temperature Percipitation avg.temp  sd.temp se.temp
## 1   1 April Spring    37.09563      2.741170 50.66127 9.717198 1.77411
## 2   2 April Spring    41.94007      2.980994 50.66127 9.717198 1.77411
## 3   3 April Spring    51.60673      3.459541 50.66127 9.717198 1.77411
## 4   4 April Spring    56.20946      3.687399 50.66127 9.717198 1.77411
## 5   5 April Spring    56.94827      3.723974 50.66127 9.717198 1.77411
## 6   6 April Spring    49.52478      3.356474 50.66127 9.717198 1.77411
##   avg.perc   sd.perc    se.perc
## 1 3.412736 0.4810494 0.08782721
## 2 3.412736 0.4810494 0.08782721
## 3 3.412736 0.4810494 0.08782721
## 4 3.412736 0.4810494 0.08782721
## 5 3.412736 0.4810494 0.08782721
## 6 3.412736 0.4810494 0.08782721

4 References

Sean Anderson
Cookbook for R

