Plyr is a R package that makes it easy to split apart, manipulate, and put back together data.
Plyr is easier because it uses the common syntax and faster because it uses an input-output format requiring less code. Additionally there are few apply functions that allow dataframes to be the input and output, whereas plyr is mainly used to manipulate dataframes. Plyr can also summarize dataframes into new dataframes, which can be useful when extracting values from large datasets. If you are not familiar with the split and apply method please refer to Chapter 4.
To start using plyr install and load the plyr library.
All examples in this chapter will be based on the Weather dataframe below. This dataset consists of 5 columns and 364 rows; the first three columns are the date and month followed by the season that particular day is in. The last two columns are the average daily temperature and percipitation for each date. If you are not familiar with generating dataframes please refer to Chapter 1
## intall.packages(plyr)
library(plyr)
#Building the Weather dataframe
NJanuary = 31
NFebruary = 28
NMarch=31
NApril=30
NMay=31
NJune=30
NJuly=31
NAugust=30
NSeptember=31
NOctober=30
NNovember=31
NDecember=30
Day<-c(seq(1:NJanuary),seq(1:NFebruary),seq(1:NMarch),seq(1:NApril),seq(1:NMay),seq(1:NJune),seq(1:NJuly),seq(1:NAugust),seq(1:NSeptember), seq(1:NOctober),seq(1:NNovember),seq(1:NDecember))
Month<-c(rep("January",NJanuary),rep("February",NFebruary),rep("March",NMarch),rep("April",NApril),rep("May",NMay),rep("June",NJune),rep("July",NJuly),rep("August",NAugust),rep("September",NSeptember),rep("October",NOctober),rep("November",NNovember),rep("December",NDecember))
Season<-c(rep("Winter",sum(NJanuary+NFebruary+20)),rep("Spring",sum(NMarch+NApril+NMay)),rep("Summer",sum(NJune+NJuly+NAugust+2)),rep("Fall",sum(NSeptember+NOctober+NNovember-1)),rep("Winter",sum(NDecember-21)))
set.seed(4)
Temperature<-c(rnorm(NJanuary,mean=21,sd=10.2),c(rnorm(NFebruary,mean=25,sd=8.6)),c(rnorm(NMarch,mean=37,sd=12.2)),c(rnorm(NApril,mean=50,sd=10.1)),c(rnorm(NMay,mean=59,sd=7)),c(rnorm(NJune,mean=70,sd=8.6)),c(rnorm(NJuly,mean=73,sd=11.4)),c(rnorm(NAugust,mean=72,sd=12.3)),c(rnorm(NSeptember,mean=64,sd=11.6)),c(rnorm(NOctober,mean=54,sd=7.8)),c(rnorm(NNovember,mean=42,sd=13.4)),c(rnorm(NDecember,mean=27,sd=8.6)))
set.seed(4)
Percipitation<-c(rnorm(NJanuary,mean=1.73,sd=.3),c(rnorm(NFebruary,mean=1.79,sd=.3)),c(rnorm(NMarch,mean=2.5,sd=1.4)),c(rnorm(NApril,mean=3.38,sd=.5)),c(rnorm(NMay,mean=3.68,sd=.9)),c(rnorm(NJune,mean=3.45,sd=1.2)),c(rnorm(NJuly,mean=3.7,sd=.8)),c(rnorm(NAugust,mean=4.9,sd=2.7)),c(rnorm(NSeptember,mean=3.21,sd=1.4)),c(rnorm(NOctober,mean=3.15,sd=.9)),c(rnorm(NNovember,mean=3.15,sd=1.3)),c(rnorm(NDecember,mean=2.24,sd=1.1)))
Weather<-data.frame(
Day = Day,
Month = Month,
Season = Season,
Temperature = Temperature,
Percipitation = Percipitation
)
head(Weather)
## Day Month Season Temperature Percipitation
## 1 1 January Winter 23.21090 1.795026
## 2 2 January Winter 15.46658 1.567252
## 3 3 January Winter 30.08968 1.997343
## 4 4 January Winter 27.07900 1.908794
## 5 5 January Winter 37.68330 2.220685
## 6 6 January Winter 28.03061 1.936783
Plyr uses the base ply() which is preceded by two letters that indicate the format of the data. The first letter denotes the format in and the second letter specifies the format out. Plyr requires the specifications of data, variable, and function after the ply base. Below are the most common data formats for plyr:
ddply is the most commonly used format. As the first 2 letters indicate, ddply takes an existing dataframe, splits it apart, extracts some information from it, and either makes a new dataframe or alters the original dataframe to include the new information.
ddply on its own can be used to sort the dataframe by a specific column without losing the corresponding values in other columns. The example below takes the Weather dataframe and generates a new dataframe, Temperature.dd, that is sorted by increasing temperature. This is done using the default function which is null. The original Weather dataframe could also have been resorted by recreating the object Weather instead of Temperature.dd
Temperature.dd<-ddply(Weather, .(Temperature))
head(Temperature.dd)
## Day Month Season Temperature Percipitation
## 1 7 January Winter 7.931284 1.3456260
## 2 19 February Winter 9.542515 1.2507854
## 3 13 February Winter 10.482782 1.2835854
## 4 11 November Fall 11.523199 0.1932954
## 5 29 January Winter 11.534113 1.4515916
## 6 9 December Fall 11.986117 0.3196197
The 3 plyr functions that are most commonly used are transform, summarise, and mutate. These functions are called for within the plyr command.
Transform modifies an existing dataframe. This can be useful if you want to compute values for an additional column in the dataframe based on information already available.
In this example we will add the additional variable of average monthly temperature to the Weather dataframe. To do this, trasform will be called for in place of the function and then the new column name and the calculation of the new variable will be calculated. Note that the new column will be called mean.month.
Weather.1<-ddply(Weather,.(Month), transform,
mean.month=mean(Temperature))
head(Weather.1)
## Day Month Season Temperature Percipitation mean.month
## 1 1 April Spring 37.09563 2.741170 50.66127
## 2 2 April Spring 41.94007 2.980994 50.66127
## 3 3 April Spring 51.60673 3.459541 50.66127
## 4 4 April Spring 56.20946 3.687399 50.66127
## 5 5 April Spring 56.94827 3.723974 50.66127
## 6 6 April Spring 49.52478 3.356474 50.66127
Summarise creates a new condensed dataframe that can summarise information about the existing dataframe. This is a similar tool to pivot tables in Excel except it allows you to summarise a wider range of variables. Please note that summarise is spelt with an āsā.
In this example we will summarise the average monthly temperature, the average monthly percipitation, and the standard deviation of both into new columns. The format of this is very similar to transform in that summarise is called for in place of the plyr function and then the names of the new columns as well as the formulas needed to generate the new columns are specified after.
Weather.summary<-ddply(Weather, .(Month), summarise,
avg.temp=mean(Temperature),
sd.temp=sd(Temperature),
avg.perc=mean(Percipitation),
sd.perc=sd(Percipitation)
)
head(Weather.summary)
## Month avg.temp sd.temp avg.perc sd.perc
## 1 April 50.66127 9.717198 3.412736 0.4810494
## 2 August 71.65274 10.372217 4.823772 2.2768282
## 3 December 27.09597 6.941381 2.252275 0.8878510
## 4 February 24.20208 8.082064 1.762166 0.2819325
## 5 January 25.85950 8.122852 1.872927 0.2389074
## 6 July 72.94086 11.973996 3.695850 0.8402804
The next example will summarise the average temperature and percipitation per season. The code for this is very similar to the first summarise example except we will change the variable that items are summarised by to Season
In this example we will summarise the average monthly temperature, the average monthly percipitation, and the standard deviation of both into new columns. The format of this is very similar to transform in that summarise is called for in place of the plyr function and then the names of the new columns as well as the formulas needed to generate the new columns are specified after.
Weather.summary2<-ddply(Weather, .(Season), summarise,
avg.temp=mean(Temperature),
avg.perc=mean(Percipitation)
)
head(Weather.summary2)
## Season avg.temp avg.perc
## 1 Fall 44.35078 2.912997
## 2 Spring 55.08098 3.285695
## 3 Summer 69.66207 3.858623
## 4 Winter 28.25241 2.050919
Mutate is a very powerful tool that is similar to transform and summarise, but it allows variables computed within the command to be used immediately to compute other variables, all of which can be incorporated back into the original data frame.
This example creates columns that calculate the average monthly temperature, percipitation and the standard deviation of these variables. The newly calculated standard deviations are then used to generate columns calculating the standard errors.
Weather<-ddply(Weather, .(Month), mutate,
avg.temp=mean(Temperature),
sd.temp=sd(Temperature),
se.temp=sd.temp/sqrt(length(Month)),
avg.perc=mean(Percipitation),
sd.perc=sd(Percipitation),
se.perc=sd.perc/sqrt(length(Month))
)
head(Weather)
## Day Month Season Temperature Percipitation avg.temp sd.temp se.temp
## 1 1 April Spring 37.09563 2.741170 50.66127 9.717198 1.77411
## 2 2 April Spring 41.94007 2.980994 50.66127 9.717198 1.77411
## 3 3 April Spring 51.60673 3.459541 50.66127 9.717198 1.77411
## 4 4 April Spring 56.20946 3.687399 50.66127 9.717198 1.77411
## 5 5 April Spring 56.94827 3.723974 50.66127 9.717198 1.77411
## 6 6 April Spring 49.52478 3.356474 50.66127 9.717198 1.77411
## avg.perc sd.perc se.perc
## 1 3.412736 0.4810494 0.08782721
## 2 3.412736 0.4810494 0.08782721
## 3 3.412736 0.4810494 0.08782721
## 4 3.412736 0.4810494 0.08782721
## 5 3.412736 0.4810494 0.08782721
## 6 3.412736 0.4810494 0.08782721