This presentation will review the basics in how to perform a between-subjects ANOVA in R using the aov function and the afex package. I will go through this using a generated dataset.
But before running this code, you will need to load the following necessary package libraries. If you don’t have the packages installed, you will need to install them before continuing. To install them, delete the “#” in the following lines of code before running it.
#install.packages(plyr)
#install.packages(ggplot2)
#install.packages(afex)
#install.packages(emmeans)
library(plyr)
library(ggplot2)
library(afex)
library(emmeans)
Generating a dataset
The created data set for a hypothetical 2x3 between-subjects experimental design where participants are asked to read a passage on different platforms and answer ten comprehension questions about the passage. The purpose of this hypothetical experiment is to examine the effect that reading platform has on text comprehension as a function of age.
- IV 1: Reading Platform (Paper, Kindle, Computer)
- IV 2: Age Group (for simplicity, the levels are just Old and Young. If you would like to examine age as a continuous variable, you can run a regression analysis. See Chapter 12 for more information on the basics of performing a regression analyses in R)
- DV: Comprehension Score (1-10)
Here, I am creating a dataframe called “Comprehension.Data”, inserting 60 participants into the dataframe, adding reading platform as a between-subjects variable with 20 participants per level, adding age group as another between-subjects variable with 10 participants in each level. Then, I am filling the dataframe with 10 participants per condition with 6 conditions. Specifying the means and standard deviations for each condition. See Chapter 7 for more information on the rnorm
function.
set.seed(42) #this is make sure you get the same values randomly each time
N.2=10 #sample size of age
N.3=20 #sample size of reading platform
Comprehension.Data <- data.frame(
Subject = seq(1,N.3*3),
Platform = c(rep("Paper", N.3), rep("Kindle", N.3), rep("Computer", N.3)),
Age = rep(c(rep("Old",N.2), rep("Young", N.2)),3),
Comprehension = c(rnorm(N.2,10,.5),rnorm(N.2,8,.5),
rnorm(N.2,6,.5),rnorm(N.2,7,.5),
rnorm(N.2,2,.5),rnorm(N.2,6,.5)))
Note: this dataset is in long-format. If your dataset is in wide-format, you have to “reshape” the data (See Chapter 8 for more information).
Here we can view the structure of the dataframe.
str(Comprehension.Data)
## 'data.frame': 60 obs. of 4 variables:
## $ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Platform : Factor w/ 3 levels "Computer","Kindle",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Age : Factor w/ 2 levels "Old","Young": 1 1 1 1 1 1 1 1 1 1 ...
## $ Comprehension: num 10.69 9.72 10.18 10.32 10.2 ...
Here we can view the first few lines of the data. You can also look at the entire dataframe by removing head
, and just calling for Comprehension.Data
.
head(Comprehension.Data)
## Subject Platform Age Comprehension
## 1 1 Paper Old 10.685479
## 2 2 Paper Old 9.717651
## 3 3 Paper Old 10.181564
## 4 4 Paper Old 10.316431
## 5 5 Paper Old 10.202134
## 6 6 Paper Old 9.946938
Let’s look at our means for each of the six conditions. We can use the tapply
function to do this. (See Chapter 3 for more information on the different apply functions).
tapply(Comprehension.Data$Comprehension,
list(Comprehension.Data$Platform,Comprehension.Data$Age), mean)
## Old Young
## Computer 1.989892 6.009197
## Kindle 5.910960 6.818048
## Paper 10.273648 7.918272
We can also look at the means collapsed across each independent variable. Here are the means for the two Age levels collapsed across Platform.
tapply(Comprehension.Data$Comprehension,
list(Comprehension.Data$Age), mean)
## Old Young
## 6.058167 6.915172
And here are the means for the three Platform levels collapsed across Age.
tapply(Comprehension.Data$Comprehension,
list(Comprehension.Data$Platform), mean)
## Computer Kindle Paper
## 3.999545 6.364504 9.095960
What is the aov function?
This is a built-in R function that allows you to run an Analysis of Variance (ANOVA). This function defaults to running a Type I Sum of Squares.
You can use the help section to see a description of the aov function where it will display the arguments that go into this function. Putting a “?” in front of a function is a way to easily find more information on the function.
?aov
Running an ANOVA using the aov function
Here, I am just inserting the names of my variables into the function, and specifying the dataset I am working with.
Model.aov.1<-aov(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data)
summary(Model.aov.1)
##
## Error: Subject
## Df Sum Sq Mean Sq
## Platform 1 207.4 207.4
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Platform 2 58.47 29.23 89.25 < 2e-16 ***
## Age 1 6.29 6.29 19.20 5.59e-05 ***
## Platform:Age 2 101.61 50.81 155.10 < 2e-16 ***
## Residuals 53 17.36 0.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you would like to specify Type II or III Sum of Squares using the aov function, you can do that by using the afex
package below:
- Langsrud, Ø. (2003). ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares. Statistics and Computing, 13, 163-167.
What is the afex package?
The afex (“Analysis of Factorial Experiments”) package is an alternative to using the aov function to run an ANOVA in R.
There are three different functions in the afex package related to calculating an ANOVA:
- aov_car (This is the main function we will focus on for this tutorial)
- aov_ez (same as aov_car, but has a different layout, easier version if you don’t understand how to specify your error term)
- aov_4 (wrapper function for lmer function in the lme4 package for mixed-effects models)
You can use the help section to get a more detailed description of the afex package functions.
?aov_car
Using the aov_car function
The aov_car function defaults to Type III Sum of Squares, which is the default in SPSS. It also gives you the generalized eta-squared, not the partial eta-squared. Additionally, this function autocorrects for violations of sphericity in RM and Mixed designs (see Chapter 21), and has a better output format than the built-in aov function as it provides effect sizes. In this case our “Age” variable was observed and not manipulated so we have passed that into the function as it will impact our effect size estimates.
Model.aov.2<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data, observed = "Age")
summary(Model.aov.2) #Defaults to type III Sum of Squares
## Anova Table (Type 3 tests)
##
## Response: Comprehension
## num Df den Df MSE F ges Pr(>F)
## Platform 2 54 0.33949 383.199 0.66519 < 2.2e-16 ***
## Age 1 54 0.33949 32.452 0.08412 5.202e-07 ***
## Platform:Age 2 54 0.33949 149.652 0.77589 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can have it show us the partial eta-squared if we want. This is what SPSS outputs for us. For more information on the differences between generalized eta-squared and partial eta-squared, refer to the following paper:
- Levine, T.R. & Hullett, C.R. (2002). Eta Squared, Partial Eta Squared and the Misreporting of Effect Size in Communication Research. Human Communication Research, 28, 612-625.
Note: PES and GES would be the same in fully-between subjects designs when all the IVs are manipulated.
Model.aov.3<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data,
anova_table = list(es = "pes"))
summary(Model.aov.3)
## Anova Table (Type 3 tests)
##
## Response: Comprehension
## num Df den Df MSE F pes Pr(>F)
## Platform 2 54 0.33949 383.199 0.93418 < 2.2e-16 ***
## Age 1 54 0.33949 32.452 0.37537 5.202e-07 ***
## Platform:Age 2 54 0.33949 149.652 0.84716 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also have it show us an output table like what you would get in SPSS by adding return="univariate"
, but it will not show you Partial Eta Squared.
Model.aov.4<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data,
anova_table = list(es = "pes"),
return="univariate")
Model.aov.4
## Anova Table (Type III tests)
##
## Response: dv
## Sum Sq Df F value Pr(>F)
## (Intercept) 2524.61 1 7436.541 < 2.2e-16 ***
## Platform 260.18 2 383.199 < 2.2e-16 ***
## Age 11.02 1 32.452 5.202e-07 ***
## Platform:Age 101.61 2 149.652 < 2.2e-16 ***
## Residuals 18.33 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Follow-up tests
Main Effects
In the hypothetical dataset, we have a significant main effect of Age (we can see from our Age mean table above that the Young group had significantly higher comprehension than the Old group), a significant main effect of Platform, and a significant Age by Platform interaction. We are going to first follow-up the significant main effect of Platform because this variable contains more than two levels.
We will follow-up this main effect by performing a pairwise analysis on fitted means to see where the significant differences between these levels lie within the Platform factor. Note below that it automatically applied the Tukey HSD correction for multiple corrections.
FittedMeans.Platform<-emmeans(Model.aov.2, ~Platform)
FittedMeans.Platform
## Platform emmean SE df lower.CL upper.CL
## Computer 3.999545 0.1302858 54 3.738337 4.260752
## Kindle 6.364504 0.1302858 54 6.103297 6.625711
## Paper 9.095960 0.1302858 54 8.834753 9.357167
##
## Results are averaged over the levels of: Age
## Confidence level used: 0.95
pairs(FittedMeans.Platform)
## contrast estimate SE df t.ratio p.value
## Computer - Kindle -2.364959 0.1842519 54 -12.835 <.0001
## Computer - Paper -5.096415 0.1842519 54 -27.660 <.0001
## Kindle - Paper -2.731456 0.1842519 54 -14.825 <.0001
##
## Results are averaged over the levels of: Age
## P value adjustment: tukey method for comparing a family of 3 estimates
We can also do this using the Bonferroni correction.
pairs(FittedMeans.Platform, adjust="bon")
## contrast estimate SE df t.ratio p.value
## Computer - Kindle -2.364959 0.1842519 54 -12.835 <.0001
## Computer - Paper -5.096415 0.1842519 54 -27.660 <.0001
## Kindle - Paper -2.731456 0.1842519 54 -14.825 <.0001
##
## Results are averaged over the levels of: Age
## P value adjustment: bonferroni method for 3 tests
These pairwise comparisons reveal that comprehension score significantly differed between all Platform groups (Paper, Kindle, and Computer).
Interaction
We also need to follow-up the significant Age by Platform interaction by performing pairwise effects for Age on Platform. We will do this using no pvalue correction as there are only 2 levels within each groups (and emmeans reads this as 1 test per family).
FittedMeans.AgebyPlatform <- emmeans(Model.aov.2, ~ Age|Platform)
FittedMeans.AgebyPlatform
## Platform = Computer:
## Age emmean SE df lower.CL upper.CL
## Old 1.989892 0.1842519 54 1.620490 2.359295
## Young 6.009197 0.1842519 54 5.639794 6.378600
##
## Platform = Kindle:
## Age emmean SE df lower.CL upper.CL
## Old 5.910960 0.1842519 54 5.541557 6.280363
## Young 6.818048 0.1842519 54 6.448645 7.187451
##
## Platform = Paper:
## Age emmean SE df lower.CL upper.CL
## Old 10.273648 0.1842519 54 9.904246 10.643051
## Young 7.918272 0.1842519 54 7.548869 8.287674
##
## Confidence level used: 0.95
pairs(FittedMeans.AgebyPlatform)
## Platform = Computer:
## contrast estimate SE df t.ratio p.value
## Old - Young -4.019305 0.2605715 54 -15.425 <.0001
##
## Platform = Kindle:
## contrast estimate SE df t.ratio p.value
## Old - Young -0.907088 0.2605715 54 -3.481 0.0010
##
## Platform = Paper:
## contrast estimate SE df t.ratio p.value
## Old - Young 2.355377 0.2605715 54 9.039 <.0001
For additional information on follow-ups for between-subjects designs see: http://www.alexanderdemos.org/ANOVA9.html
Plotting the results
Here, I will plot the results on a bar graph using the ggplot2 package. This is a very basic graph, but if you want more information on how to do more fancy plots, refer to Chapters 10 & 11 or http://www.alexanderdemos.org/ANOVA8.html#analysis_in_r.
Plot.ComprehensionData <-ggplot(data = Comprehension.Data, aes(x = Platform, y =Comprehension, fill=Age))+
coord_cartesian(ylim = c(1,12))+ #specifying the y-axis scale
geom_bar(stat='identity', position=position_dodge(width=1))+
xlab("Platform Type")+ylab("Comprehension Score")+ #labelling the x-axis and y-axis
theme_bw()+
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank())+
scale_fill_grey()
Plot.ComprehensionData #displaying the graph

---
title: "Chapter 20: Between-Subjects ANOVA in R"
author: "Andriana Christofalos"
output:
  html_document:
    theme: cerulean
    highlight: textmate
    fontsize: 8pt
    toc: true
    number_sections: true
    code_download: true
    toc_float:
      collapsed: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This presentation will review the basics in how to perform a between-subjects ANOVA in R using the aov function and the afex package. I will go through this using a generated dataset. 

But before running this code, you will need to load the following necessary package libraries. If you don't have the packages installed, you will need to install them before continuing. To install them, delete the "#" in the following lines of code before running it.

```{r, message=FALSE}
#install.packages(plyr)
#install.packages(ggplot2)
#install.packages(afex)
#install.packages(emmeans)

library(plyr)   
library(ggplot2)  
library(afex)  
library(emmeans)
```

# Generating a dataset
The created data set for a hypothetical 2x3 between-subjects experimental design where participants are asked to read a passage on different platforms and answer ten comprehension questions about the passage. The purpose of this hypothetical experiment is to examine the effect that reading platform has on text comprehension as a function of age.

- IV 1: Reading Platform (Paper, Kindle, Computer)
- IV 2: Age Group (for simplicity, the levels are just Old and Young. If you would like to examine age as a continuous variable, you can run a regression analysis. See Chapter 12 for more information on the basics of performing a regression analyses in R)
- DV: Comprehension Score (1-10)

Here, I am creating a dataframe called "Comprehension.Data", inserting 60 participants into the dataframe, adding reading platform as a between-subjects variable with 20 participants per level, adding age group as another between-subjects variable with 10 participants in each level. Then, I am filling the dataframe with 10 participants per condition with 6 conditions. Specifying the means and standard deviations for each condition. See Chapter 7 for more information on the ```rnorm``` function. 

```{r, echo=TRUE}
set.seed(42) #this is make sure you get the same values randomly each time

N.2=10 #sample size of age 
N.3=20 #sample size of reading platform

Comprehension.Data <- data.frame( 
 Subject = seq(1,N.3*3), 
 Platform = c(rep("Paper", N.3), rep("Kindle", N.3), rep("Computer", N.3)), 
 Age = rep(c(rep("Old",N.2), rep("Young", N.2)),3), 
 Comprehension = c(rnorm(N.2,10,.5),rnorm(N.2,8,.5), 
                    rnorm(N.2,6,.5),rnorm(N.2,7,.5),
                    rnorm(N.2,2,.5),rnorm(N.2,6,.5))) 
```

Note: this dataset is in long-format. If your dataset is in wide-format, you have to "reshape" the data (See Chapter 8 for more information).

Here we can view the structure of the dataframe.
```{r, echo=TRUE, message=FALSE}
str(Comprehension.Data) 
```

Here we can view the first few lines of the data. You can also look at the entire dataframe by removing ```head```, and just calling for ```Comprehension.Data```.

```{r, echo=TRUE, message=FALSE}
head(Comprehension.Data)
```

Let's look at our means for each of the six conditions. We can use the ```tapply``` function to do this. (See Chapter 3 for more information on the different apply functions).

```{r, echo=TRUE, message=FALSE}
tapply(Comprehension.Data$Comprehension,
        list(Comprehension.Data$Platform,Comprehension.Data$Age), mean)
```

We can also look at the means collapsed across each independent variable. Here are the means for the two Age levels collapsed across Platform.

```{r, echo=TRUE, message=FALSE}
tapply(Comprehension.Data$Comprehension,
        list(Comprehension.Data$Age), mean)
```

And here are the means for the three Platform levels collapsed across Age.

```{r, echo=TRUE, message=FALSE}
tapply(Comprehension.Data$Comprehension,
        list(Comprehension.Data$Platform), mean)
```

# What is the aov function?
This is a built-in R function that allows you to run an Analysis of Variance (ANOVA). This function defaults to running a Type I Sum of Squares.

You can use the help section to see a description of the aov function where it will display the arguments that go into this function. Putting a "?" in front of a function is a way to easily find more information on the function. 
```{r, echo=TRUE, message=FALSE}
?aov
```

## Running an ANOVA using the aov function
Here, I am just inserting the names of my variables into the function, and specifying the dataset I am working with. 

```{r, echo=TRUE, message=FALSE}
Model.aov.1<-aov(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data)

summary(Model.aov.1)            
```

If you would like to specify Type II or III Sum of Squares using the aov function, you can do that by using the `afex` package below:

- Langsrud, Ø. (2003). ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares. *Statistics and Computing*, 13, 163-167. 

# What is the afex package? 
The afex ("Analysis of Factorial Experiments") package is an alternative to using the aov function to run an ANOVA in R.  

There are three different functions in the afex package related to calculating an ANOVA:

- aov_car (This is the main function we will focus on for this tutorial)
- aov_ez (same as aov_car, but has a different layout, easier version if you don't understand how to specify your error term)
- aov_4 (wrapper function for lmer function in the lme4 package for mixed-effects models)

You can use the help section to get a more detailed description of the afex package functions. 

```{r, echo=TRUE, message=FALSE}
?aov_car
```

## Using the aov_car function
The aov_car function defaults to Type III Sum of Squares, which is the default in SPSS. It also gives you the generalized eta-squared, not the partial eta-squared. Additionally, this function autocorrects for violations of sphericity in RM and Mixed designs (see Chapter 21), and has a better output format than the built-in aov function as it provides effect sizes. In this case our "Age" variable was observed and not manipulated so we have passed that into the function as it will impact our effect size estimates.  

```{r, echo=TRUE, message=FALSE}
Model.aov.2<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data, observed = "Age")

summary(Model.aov.2) #Defaults to type III Sum of Squares
```

We can have it show us the partial eta-squared if we want. This is what SPSS outputs for us. For more information on the differences between generalized eta-squared and partial eta-squared, refer to the following paper:

- Levine, T.R. & Hullett, C.R. (2002). Eta Squared, Partial Eta Squared and the Misreporting of Effect Size in Communication Research. *Human Communication Research*, 28, 612-625. 

*Note: PES and GES would be the same in fully-between subjects designs when all the IVs are manipulated.*

```{r, echo=TRUE, message=FALSE}
Model.aov.3<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data,
                     anova_table = list(es = "pes"))

summary(Model.aov.3)
```

We can also have it show us an output table like what you would get in SPSS by adding ```return="univariate"```, but it will not show you Partial Eta Squared. 

```{r, echo=TRUE, message=FALSE}
Model.aov.4<-aov_car(Comprehension ~ Platform*Age + Error(Subject), data=Comprehension.Data,
                     anova_table = list(es = "pes"),
                     return="univariate")
Model.aov.4
```

# Follow-up tests
## Main Effects

In the hypothetical dataset, we have a significant main effect of Age (we can see from our Age mean table above that the Young group had significantly higher comprehension than the Old group), a significant main effect of Platform, and a significant Age by Platform interaction. We are going to first follow-up the significant main effect of Platform because this variable contains more than two levels. 

We will follow-up this main effect by performing a pairwise analysis on fitted means to see where the significant differences between these levels lie within the Platform factor. Note below that it automatically applied the Tukey HSD correction for multiple corrections. 

```{r, echo=TRUE, message=FALSE}
FittedMeans.Platform<-emmeans(Model.aov.2, ~Platform)
FittedMeans.Platform

pairs(FittedMeans.Platform) 
```

We can also do this using the Bonferroni correction. 

```{r, echo=TRUE, message=FALSE}
pairs(FittedMeans.Platform, adjust="bon") 
```

These pairwise comparisons reveal that comprehension score significantly differed between all Platform groups (Paper, Kindle, and Computer). 

## Interaction

We also need to follow-up the significant Age by Platform interaction by performing pairwise effects for Age on Platform. We will do this using no pvalue correction as there are only 2 levels within each groups (and emmeans reads this as 1 test per family).

```{r, echo=TRUE, message=FALSE}
FittedMeans.AgebyPlatform <- emmeans(Model.aov.2, ~ Age|Platform) 

FittedMeans.AgebyPlatform

pairs(FittedMeans.AgebyPlatform)
```

*For additional information on follow-ups for between-subjects designs see: http://www.alexanderdemos.org/ANOVA9.html*

# Plotting the results

Here, I will plot the results on a bar graph using the ggplot2 package. This is a very basic graph, but if you want more information on how to do more fancy plots, refer to Chapters 10 & 11 or http://www.alexanderdemos.org/ANOVA8.html#analysis_in_r.

```{r, echo=TRUE, message=FALSE}
Plot.ComprehensionData <-ggplot(data = Comprehension.Data, aes(x = Platform, y =Comprehension, fill=Age))+
  coord_cartesian(ylim = c(1,12))+ #specifying the y-axis scale
  geom_bar(stat='identity', position=position_dodge(width=1))+
  xlab("Platform Type")+ylab("Comprehension Score")+ #labelling the x-axis and y-axis
  theme_bw()+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank())+
  scale_fill_grey()

Plot.ComprehensionData #displaying the graph 
```



<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-98878793-1', 'auto');
  ga('send', 'pageview');

</script>