1 Purpose of this chapter

• In this chapter, we are going to cover the strengths, weaknesses, and when or when not to use three common types of correlations (Pearson, Spearman, and Kendall). It’s part statistics refresher, part R tutorial.

2 A BRIEF overview of Correlations

The three correlations we will be using are some of the most common (though Kendall is less so).

2.1 Pearson Correlation:

2.1.1 Assumption 1:

• Your data is interval or ratio. These types of continous data are important for how the correlation assumes values in variables will be related, and thus ordinal or categorical variable coding won’t work.

2.1.2 Semi-Assumption 2:

• As stated above, Pearson only works with linear data. That means that your two correlated factors have to approximate a line, and not a curved or parabolic shape. It’s not that you can’t use pearson to see if there is a linear relationship in data, it’s just that there are other tests suited to analyzing those different data structures.

2.1.4 Assumption 4:

• The data you are analyzing needs to be normally distributed. This can be done in a couple of ways (Skewness, Kurtosis) but it can also be done in a quick and dirty manner through histograms.

3 Setting up the dataset

• Now let’s simluate a dataset to take a look at how the results of these different kinds of correlatiosn may be affected by different parameters of data. First, we need to install some packages.
#install.packages("MASS")
library(MASS)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("rococo")
library(rococo)
#install.packages("psych")
library(psych)
#install.packages("lpSolve")
library(lpSolve)
#install.packages("irr")
library(irr)
#install.packages("mvtnorm")
library(mvtnorm)
#Step 1 - set the parameters of our dataset and data

# Desired Correlation
d.cor <- 0.7
# Desired mean of X
d.mx <- 80
# Desired range of X
d.rangex <- 20
# Desired mean of Y
d.my <- 90
# Desired range of Y
d.rangey <- 20

##Step2
# Calculations to create multiplication and addition factors for mean and range of X and Y
mx.factor <- d.rangex/6
my.factor <- d.rangey/6

# Generate data - for this example, let's think of this as 60 students (rows).  Let's say they all took a test at the beginning
#of the semester, and then again at the end of the semester.  That will give us 2 columns of data, which is 2 scores per student,
#with a pearson correlation of .80.  Note that you can adjust the parameters as you like with the code in Steps 1 and 2.  For now,
#we will be making each test score roughly normally distributed.

out <- as.data.frame(mvrnorm(400, mu = c(0,0),
Sigma = matrix(c(1,d.cor,d.cor,1), ncol = 2),
empirical = TRUE))

# Adjust so that values are positive and include factors to match desired means and ranges
#(we don't want negative vales on a test score)
#and also rename them to Test.1, and Test. 2  We will leave "V1" $"V2" in the datsset in case we #want to alter the range of the data and the correlation later. out$"Test.1" <- (out$V1 - min(out$V1))*mx.factor + addx.factor
out$"Test.2" <- (out$V2 - min(out$V2))*my.factor + addy.factor ##It may also be helpful to give each student an ID number in case we want to look at specific student data later on #To do this, we need to create a variable, n, that will always adapt to the number of subjects you have to give them a subject number #in case you want to alter the number of subjects in your simulated data set n<-length(out$"Test.1")

#now we need to create a subject id column
Sub.Id<-c(1:n)

##and then put it as a new column in our data frame using the "cbind" function
Class.Data<-cbind(ID=Sub.Id,out)

#and then check our work
View(Class.Data)

#We can also look at our histograms to make sure the data within our individual tests is normally distributed

hist(out$"Test.1") hist(out$"Test.2")

-Now we can look out our data in a scatterplot, and also fit a linear trend line, to make sure it looks correlated, and also that the linear trend line looks good.

# Create liniear model to calculate intercept and slope
fit <- lm(out$Test.2 ~ out$Test.1, data=out)
coef(fit)
## (Intercept)  out$Test.1 ## 34.68142 0.70000 # Plot scatterplot along with regression line ggplot(out, aes(x=Test.1, y=Test.2)) + geom_point() + coord_fixed() + geom_smooth(method='lm') # Produce summary table summary(out) ## V1 V2 Test.1 Test.2 ## Min. :-2.97985 Min. :-3.19032 Min. :70.00 Min. :80.00 ## 1st Qu.:-0.62519 1st Qu.:-0.68310 1st Qu.:77.85 1st Qu.:88.36 ## Median : 0.01322 Median : 0.03225 Median :79.98 Median :90.74 ## Mean : 0.00000 Mean : 0.00000 Mean :79.93 Mean :90.63 ## 3rd Qu.: 0.69287 3rd Qu.: 0.68248 3rd Qu.:82.24 3rd Qu.:92.91 ## Max. : 3.37708 Max. : 2.67112 Max. :91.19 Max. :99.54 4 Comparing Correlations Now we want to check our three different pairwise comparisons and compare their values. cor(Class.Data$Test.1,out$Test.2, method = c("pearson")) ## [1] 0.7 cor(Class.Data$Test.1,out$Test.2, method = c("spearman")) ## [1] 0.6822871 cor(Class.Data$Test.1,out$Test.2, method = c("kendall")) ## [1] 0.493584 • we can see pearson and spearman are roughly the same, but kendall is very much different. That’s because Kendall is a test of strength of dependece (i.e. one could be written as a linear function of the other), whereas Pearson and Spearman are nearly equivalent in the way they correlate normally distributed data. All of these correlations are correct in their result, it’s just that Pearson/Spearman are looking at the data in one way, and Kendall in another. • A better situation for spearman or kendall (but not for pearson) when the data is ORDINAL, in that it is ranked. So let’s transform the test 1 scores into rank scores of how well each classmate did relative to one another. ##Create a rank for test one. See more about the "rank" function below: #?rank Test.1.Rank <-rank(Class.Data$Test.1, na.last=NA,ties.method="first")

## And make a new data set with the rank test based on test 1 score
Class.Rank.1<-cbind(Test.1.Rank=Test.1.Rank,Class.Data)

#and now check our work
View(Class.Rank.1)

-And now let’s check the correlations again with the test 1 ranked data and the test 2 raw data:

cor(Class.Rank.1$Test.1.Rank,Class.Rank.1$Test.2,
method=c("pearson"))
## [1] 0.6871984
cor(Class.Rank.1$Test.1.Rank,Class.Rank.1$Test.2,
method=c("spearman"))
## [1] 0.6822871
cor(Class.Rank.1$Test.1.Rank,Class.Rank.1$Test.2,
method=c("kendall"))
## [1] 0.493584
• Here again we can see that pearson and spearman are very similar, though pearson has changed slightly. This is likely due to the granularity of one of the sources of data changing to whole integers instead of the numerous decimal places they had previously. However, we see that spearman and Kendall are exactly the same, as they is not as dependent upon the granularity of the integers.

• Let’s transform the second score into a rank as well, just to see how it looks:

Test.2.Rank <-rank(Class.Rank.1$Test.2, na.last=NA,ties.method="first") Class.Rank.2<-cbind(Test.2.Rank=Test.2.Rank,Class.Rank.1) cor(Class.Rank.2$Test.1.Rank,Class.Rank.2$Test.2.Rank, method=c("pearson")) ## [1] 0.6822871 cor(Class.Rank.2$Test.1.Rank,Class.Rank.2$Test.2.Rank, method=c("spearman")) ## [1] 0.6822871 cor(Class.Rank.2$Test.1.Rank,Class.Rank.2$Test.2.Rank, method=c("kendall")) ## [1] 0.493584 • Now we can see that Pearson exactly matches spearman, as would be expected since the integers are now whole across the board. • While these data are technically ordinal, what we’ve really done is a transformation from raw scores to rank integers. We should expect these to correlate nearly the same (or exactly the same) as the raw scores since they inherently linked. A different way to better expose the differences between these correlations may be to create a non-normal distribution, which can create problems for the Pearson correlation. • Let’s make a uniform distribution of (hypothetically, as this would likely be normally distributed in real life) the children’s average math scores throughout the year. ##Create the new varible with a normal distribution. #Look more at the runif function here #?runif #so lets make a new test 1 that has a uniform distribution with a range from 50-100 using the "runif" function Math.Avg<-runif(400,min=50,max=100) ##Let's check the shape of the distribution and notice it's not normal hist(Math.Avg)  #and now put the new uniforn test into the data set Class.Uni<-cbind(Math.Avg=Math.Avg,Class.Rank.2) #and check our work View(Class.Uni) And now let’s look at our correlations, with our original Test 2. These correlations are going to be vastly different than our previous correlations. ##now let's do some correlations between the new uniform test scores and the original test 2 scores cor(Class.Uni$Math.Avg,Class.Uni$Test.2, method=c("spearman")) ## [1] -0.01241258 cor(Class.Uni$Math.Avg,Class.Uni$Test.2, method=c("pearson")) ## [1] -0.0342776 cor(Class.Uni$Math.Avg,Class.Uni$Test.2, method=c("kendall")) ## [1] -0.008621554 While in reality it may not be the case that math ability and english (or language, generally) ability are this uncorrelated, in our hypothetical world they are very unrelated. Though pearson and spearman may be close to one another, spearman is reliable in this case because the data is not normally distributed. Again, you can still do a pearson correlation on non-normal data, but it’s not going to be as relaible as a non-parametric test which does not assume normality. On the other hand, we can also see that these data are not linearly dependent upon one another, as the kendall correlation is very low also. Now lets rank order test 1, turning it into ordinal data, and see what happens ##rank order tests based on test 1 score Math.Uni.Rank <-rank(Class.Uni$Math.Avg, na.last=NA,ties.method="first")

##add the rank order tests to the data frame
Class.Uni.Rank<-cbind(Math.Rank=Math.Uni.Rank,Class.Uni)

#and check our work
View(Class.Uni.Rank)

#now lets correlate those ranks with test 2
cor(Class.Uni.Rank$Math.Rank,Class.Uni.Rank$Test.2,
method=c("spearman"))
## [1] -0.01241258
cor(Class.Uni.Rank$Math.Rank,Class.Uni.Rank$Test.2,
method=c("pearson"))
## [1] -0.02603802
cor(Class.Uni.Rank$Math.Rank,Class.Uni.Rank$Test.2,
method=c("kendall"))
## [1] -0.008621554
• Now we can see that the correlations have remained basically the same, similar to as when we did this with the normally distributed data. Again, the spearman (for relationship) and kendal (for dependence) are going to be more reliable here than pearson.

• Note that this data (since it went to so many decimal points) did not have any ties. When you have data that is originaly in whole integers, the rank function is much more important to be aware of in how it handles ties.

Let’s quickly look at how things might change if those uniform math scores were rounded prior to ranking

#create the rounded values from the original math scores
Math.Avg.R <-round(Class.Uni$Math.Avg,digits = 0) #bind those to a new dataset Class.Math.Round<-cbind(Math.Round=Math.Avg.R,Class.Uni) #this viewing is optional #View(Class.Math.Round) #now rank the roudned values Math.Rank.Round <-rank(Math.Avg.R, na.last=NA,ties.method="first") #and bind them to a new dataset Class.Math.RoundRank<-cbind(Math.RoundRank=Math.Rank.Round,Class.Math.Round) #and now view View(Class.Math.RoundRank) We can already see that the ranked math scores that depend upon the whole ingeters might change these correlation values, but let’s check. cor(Class.Math.RoundRank$Math.RoundRank,Class.Uni.Rank$Test.2, method=c("spearman")) ## [1] -0.01240358 cor(Class.Math.RoundRank$Math.RoundRank,Class.Uni.Rank$Test.2, method=c("pearson")) ## [1] -0.02592213 cor(Class.Math.RoundRank$Math.RoundRank,Class.Uni.Rank$Test.2, method=c("kendall")) ## [1] -0.008746867 These changes aren’t dramatic, but in the rank package, there are 6 different ways to handle tie values. If we change it to, say, “average” from “first” #create the rounded values from the original math scores Math.Avg.RA <-round(Class.Uni$Math.Avg,digits = 0)
#bind those to a new dataset
Class.Math.RoundA<-cbind(Math.Round=Math.Avg.RA,Class.Uni)
#this viewing is optional
#View(Class.Math.Round)

#now rank the roudned values
Math.Rank.RoundA <-rank(Math.Avg.RA, na.last=NA,ties.method="average")
#and bind them to a new dataset
Class.Math.RoundRankA<-cbind(Math.RoundRankA=Math.Rank.RoundA,Class.Math.RoundA)

#and now view
View(Class.Math.RoundRankA)

cor(Class.Math.RoundRankA$Math.RoundRankA,Class.Uni.Rank$Test.2,
method=c("spearman"))
## [1] -0.0124663
cor(Class.Math.RoundRankA$Math.RoundRankA,Class.Uni.Rank$Test.2,
method=c("pearson"))
## [1] -0.02586804
cor(Class.Math.RoundRankA$Math.RoundRankA,Class.Uni.Rank$Test.2,
method=c("kendall"))
## [1] -0.008545671

Again, we see that these changes aren’t dramatic, but it shows that even small decisions in how your data is handled can affect your results, even when the basis of your data is the same, and the correlation you use is the same. In other studies, this may greatly impact your interpretations of your data.

5 Conclusion

Given what we see above, there are a number of things to be aware of before going with the commonly used pearson correlations. Beyond the assumptions, it’s important to know if you are looking for relationship or dependence between variables. It’s also important to be aware what may happen to your correlations if you transform your data into ranked scores (though that was not a huge factor here), or how two different distributions of data from different (in this case subject areas) can impact what statsitic your use. There are a number of different threads across forums discussing the differences between these statsitics (e.g. https://stats.stackexchange.com/questions/3943/kendall-tau-or-spearmans-rho) if you have more specific questions regarding how to use these statistics with your data.

It takes dilligence to use the right correlation!