# 1 REGRESSION BASICS

Load the libraries we are going to need. If you don’t have these libraries, you can use the install.packages() command to install them.

stargazer makes pretty regression tables, with multiple models side-by-side.

You can visualize your interactions using a couple different libraries: effects visualizes using lattice plots, whereas sjPlot visualizes using ggplot.

MASS is used for stepwise regression, as well as a range of other linear regression tasks.

relaimpo and ggplot2 are modern tools used to determine factor importance.

library(stargazer)
library(foreign)
library(car)
library(gvlma)
library(effects)
library(sjPlot)
library(MASS)
library(relaimpo)
library(ggplot2) 

## 1.1 Get data file in place

First, let’s make sure that we set our working directory. See Chapter 1 if you need a refresher on how and why we do that.

#set working directory
#setwd("path of working directory goes here!")

Below, I’ve created several examples for us to play around with while we learn regression basics & diagnostics.

### 1.1.1Example 1: A Thanksgiving to Not Remember

To make our example festive, imagine I’m interested in which factors predict enjoyment of Thanksgiving dinner. I randomly sampled 20 people and asked them about their Thanksgiving feast. I am specifically interested in the impact of political identification (Republican vs. Democrat) on the amount of wine consumed during Thanksgiving dinner (measured in glasses). I hypothesized that Democrats would drink their sorrows away during Thanksgiving (the perfect time to drink without judgment!) to a greater extent than Republicans. Note that we have a categorical predictor predicting a continuous criterion.

Let’s make the data frame. See Chapter 1 for dataframe basics, and Chapter 7 for more information on the rnorm function.

#set the seed. This ensures that rnorm will sample the same datapoints for you as it did for me.
set.seed(57)

NRepublicans = 10
NDemocrats = 10

#ID number
ID<-factor(c(seq(1:NRepublicans),seq(1:NDemocrats)))

#Vector of labels
Group<-c(rep("Republicans",NRepublicans),rep("Democrats",NDemocrats))

#vector of datapoints for wine consumed
Wine<-c(rnorm(NRepublicans,mean=5,sd=1),rnorm(NDemocrats,mean=10,sd=1.5))

### Let's put it all together
dataset.gobble1<-data.frame(
subjectID = ID,
PolID = Group,
WineO = Wine)

str(dataset.gobble1)
## 'data.frame':    20 obs. of  3 variables:
##  $subjectID: Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... ##$ PolID    : Factor w/ 2 levels "Democrats","Republicans": 2 2 2 2 2 2 2 2 2 2 ...
##  $WineO : num 4.31 3.23 5.62 7.02 5.14 ... levels(dataset.gobble1$PolID)
## [1] "Democrats"   "Republicans"
print(dataset.gobble1)
##    subjectID       PolID     WineO
## 1          1 Republicans  4.306234
## 2          2 Republicans  3.233171
## 3          3 Republicans  5.622536
## 4          4 Republicans  7.016676
## 5          5 Republicans  5.140933
## 6          6 Republicans  6.625664
## 7          7 Republicans  6.390078
## 8          8 Republicans  4.122013
## 9          9 Republicans  3.971433
## 10        10 Republicans  5.951447
## 11         1   Democrats 12.681963
## 12         2   Democrats 13.570811
## 13         3   Democrats  8.571809
## 14         4   Democrats  9.715497
## 15         5   Democrats  7.543088
## 16         6   Democrats  8.794743
## 17         7   Democrats  6.065492
## 18         8   Democrats  7.182969
## 19         9   Democrats 10.466077
## 20        10   Democrats 12.641400

## 1.2 Run a basic linear regression model

We will regress political identification onto our criterion variable, wine consumption.

To run a basic regression model, use the lm() function.

?lm

It generally takes the layout of “Name your model”<-lm(Criterion~Predictor, data=“name of your dataset”). Note that in your output, the “estimate” value denotes the unstandardized beta for each predictor. Also note that I didn’t dummy code the categorical predictor. R is very smart and can handle this (it already has a contrast table built in for every factor variable). You can look at the current contrast table for our factor variable (political ID) by using the contrasts() command.

Refer to the following resource for additional info on linear model notation in R http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf

contrasts(dataset.gobble1$PolID) ## Republicans ## Democrats 0 ## Republicans 1 You can see that it is currently dummy coded with Democrats as the reference group. What R outputs by default will be drink consumption for one level of the variable in relation to the reference group level. In our case, because Democrats is the reference group, we should see the amount of drinks consumed for Republicans, relative to Democrats. Let’s confirm that’s what we get. #use lm() function here to run your regression model Gobble.model.1<-lm(WineO~PolID, data=dataset.gobble1) summary(Gobble.model.1) ## ## Call: ## lm(formula = WineO ~ PolID, data = dataset.gobble1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6579 -1.1803 -0.0525 1.2110 3.8474 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.7234 0.6421 15.143 1.1e-11 *** ## PolIDRepublicans -4.4854 0.9081 -4.939 0.000106 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.031 on 18 degrees of freedom ## Multiple R-squared: 0.5754, Adjusted R-squared: 0.5518 ## F-statistic: 24.4 on 1 and 18 DF, p-value: 0.0001059 #Stargazer package gave us a pretty summary table, right?! #Note that it outputs a measure of effect size for the model (i.e., adjusted R-squared) stargazer(Gobble.model.1,type="text") ## ## =============================================== ## Dependent variable: ## --------------------------- ## WineO ## ----------------------------------------------- ## PolIDRepublicans -4.485*** ## (0.908) ## ## Constant 9.723*** ## (0.642) ## ## ----------------------------------------------- ## Observations 20 ## R2 0.575 ## Adjusted R2 0.552 ## Residual Std. Error 2.031 (df = 18) ## F Statistic 24.396*** (df = 1; 18) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 #Want to save this table as a txt file? #Set your working directory, then run code below #The table will save to your working directory, and then you can open this file with any word processor #stargazer(Gobble.model.1,type="text", out="gobble1.txt") What did we find? Political ID (specifically being Republican) predicted less drink consumption on Thanksgiving. AKA: Republicans (relative to Democrats) were less likely to drink away their sorrows, as predicted. Cool! Want to change your dummy coding? If you’d like to change how R dummy codes your categorical predictor, you can use the relevel() command. For example, I can change the reference group in our example to be Republicans. dataset.gobble1$PolID2 = relevel(dataset.gobble1$PolID, ref="Republicans") contrasts(dataset.gobble1$PolID2)
##             Democrats
## Republicans         0
## Democrats           1

Now Republicans are the reference group.

### 1.2.1Example 2: A Thanksgiving Not to Remember, Part 2

Recall that I’m interested in all things Thanksgiving. I randomly sampled 250 people and asked them about their Thanksgiving. Now, I am specifically interested in the impact of turkey consumption (measured in grams) on the amount of time that people need to nap after Thanksgiving dinner (measured in minutes). I hypothesized that greater turkey consumption would predict more time napping after dinner.

Let’s create the data frame. Note that unlike the example above, we are now working with a continuous predictor and a continuous criterion.

#set the seed. This ensures that rnorm will sample the same datapoints for you as it did for me.
set.seed(59)

N <- 250
TurkeyTime <- runif(N, 0, 50)
NapTime <- (2*TurkeyTime + rnorm(N, sd=2)) + 1.5
#I made nap time have a positive linear relationship with turkey so this example would work out nicely. Hence, the 2*TurkeyTime above.
#Why did I add rnorm(N, sd=2)? To add normally distributed errors to the Naptime on TurkeyTime linear relationship.
#Why did I add a 1.5 to the equation? To ensure that I don't get an intercept for NapTime that is negative (it wouldn't make sense to nap negative minutes!)

#ID number
ID<-factor(c(seq(1:N)))

### Let's put it all together
dataset.gobble2<-data.frame(
subjectID = ID,
Turkeyyy = TurkeyTime,
Sleepyyy = NapTime)

str(dataset.gobble2)
## 'data.frame':    250 obs. of  3 variables:
##  $subjectID: Factor w/ 250 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... ##$ Turkeyyy : num  1.56 27.61 43.64 35.97 31.23 ...
##  $Sleepyyy : num 4.28 58.56 88.03 75.58 65.4 ... head(dataset.gobble2) ## subjectID Turkeyyy Sleepyyy ## 1 1 1.564606 4.280986 ## 2 2 27.606268 58.560724 ## 3 3 43.640831 88.027656 ## 4 4 35.967133 75.581889 ## 5 5 31.230798 65.395897 ## 6 6 39.810387 83.226087 Now let’s run our regression model, with turkey consumption predicting nap time. #use lm() function here to run your regression model Gobble.model.2<-lm(Sleepyyy~Turkeyyy, data=dataset.gobble2) summary(Gobble.model.2) ## ## Call: ## lm(formula = Sleepyyy ~ Turkeyyy, data = dataset.gobble2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.2833 -1.4101 0.2003 1.3977 5.1238 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.289839 0.265924 4.85 2.18e-06 *** ## Turkeyyy 2.005301 0.009012 222.50 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.056 on 248 degrees of freedom ## Multiple R-squared: 0.995, Adjusted R-squared: 0.995 ## F-statistic: 4.951e+04 on 1 and 248 DF, p-value: < 2.2e-16 #Stargazer package gave us a pretty summary table, right?! #Note that it outputs a measure of effect size for the model (i.e., adjusted R-squared) stargazer(Gobble.model.2,type="text") ## ## =============================================== ## Dependent variable: ## --------------------------- ## Sleepyyy ## ----------------------------------------------- ## Turkeyyy 2.005*** ## (0.009) ## ## Constant 1.290*** ## (0.266) ## ## ----------------------------------------------- ## Observations 250 ## R2 0.995 ## Adjusted R2 0.995 ## Residual Std. Error 2.056 (df = 248) ## F Statistic 49,507.520*** (df = 1; 248) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01 #Want to save this table as a txt file? #Set your working directory, then run code below #The table will save to your working directory, and then you can open this file with any word processor #stargazer(Gobble.model.2,type="text", out="gobble2.txt") What did we find? Greater turkey consumption significantly predicted more time napping after Thanksgiving dinner, as predicted. # 2 REGRESSION ASSUMPTIONS Before we submit our findings to the Journal of Thanksgiving Science, we need to verifiy that we didn’t violate any regression assumptions. Let’s review what our basic linear regression assumptions are conceptually, and then we’ll turn to diagnosing these assumptions in the next section below. Linearity Linear regression is based on the assumption that your model is linear (shocking, I know). Violation of this assumption is very serious–it means that your linear model probably does a bad job at predicting your actual (non-linear) data. Perhaps the relationship between your predictor(s) and criterion is actually curvilinear or cubic. If that is the case, a linear model does a bad job at modeling that relationship, and it is inappropriate to use such a model. There’s no point in worrying about significance tests or confidence intervals if a linear model doesn’t reflect your non-linear data. Normally distributed stuff & things The assumption of normality in regression manifests in three ways: 1. For confidence intervals around a parameter to be accurate, the paramater must come from a normal distribution. 2. For significance tests of models to be accurate, the sampling distribution of the thing you’re testing must be normal. 3. To get the best estimates of parameters (i.e., betas in a regression equation), the residuals in the population must be normally distributed. This assumption is most important when you have a small sample size (because central limit theorem isn’t working in your favor), and when you’re interested in constructing confidence intervals/doing significance testing. Homoscedasticity/homogeneity of variance Homogeneity of variance occurs when the spread of scores for your criterion is the same at each level of the predictor. When this assumption is satisfied, your parameter estimates will be optimal. When there are unequal variances of the criterion at different levels of the predictor (i.e., when this assumption is violated), you’ll have inconsistency in your standard error & parameter estimates in your model. Subsequently, your confidence intervals and significance tests will be biased. Independence Last but not least, independence means that the errors in your model are not related to each other. Computation of standard error relies on the assumption of independence, so if you don’t have standard error, say goodbye to confidence intervals and significance tests. No influential outliers This isn’t technically an assumption of regression, but it’s best practice to avoid influential outliers. Why is it problematic to have outliers in your data? Outliers can bias parameter estimates (e.g., mean), and they also affect your sums of squares. Sums of squares are used to estimate the standard error, so if your sums of squares are biased, your standard error likely is too. It’s really bad to have a biased standard error because it is used to calculate confidence intervals around our parameter estimate. In other words, outliers could lead to biased confidence intervals. Not good! We need to rid ourselves of those pesky outliers! # 3 REGRESSION DIAGNOSTICS So now that your memory is refreshed on all regression assumptions, how do we know whether we’ve violated any of those assumptions? We can run diagnostics in R to assess whether our assumptions are satisfied or violated. But first, it always helps to visualize the relationship between our variables to get an intuitive grasp of the data. plot(TurkeyTime, NapTime, main="Scatterplot of Thanksgiving", xlab="Turkey Consumption in Grams ", ylab="Sleep Time in Minutes ", pch=19) The relationship between turkey consumption and sleep time looks pretty darn linear! Neato! But let’s see what our other diagnostic plots say. ## 3.1 Getting a Broad Snapshot: plot() Let’s run the plot() function to run four different diagnostic plots. We can visually inspect these plots to get a sense of how we did on several of the assumptions outlined above. Note that I am inspecting how my second model did on satisfying assumptions (i.e., when turkey consumption predicted nap time). #This lets you view each diagnistic plot one at a time. #I will walk you through 4 plots, and interpret each in turn. plot(Gobble.model.2,which=1) Plot 1: The first plot depicts residuals versus fitted values. The plot of residuals versus predicted values is useful for checking the assumption of linearity and homoscedasticity. If the model does NOT meet the linear model assumption, we would see our residuals take on a defined shape or a distinctive pattern. For example, if your plot looks like a parabola, that’s bad. Your scatterplot of residuals should look like the night sky–no distinctive patterns. The red line through your scatterplot should also be straight and horizontal, not curved, if the linearity assumption is satisfied. To assess if the homoscedasticity assumption is met we look to make sure that the residuals are equally spread around the y = 0 line. How did we do? R automatically flagged 3 data points that have large residuals (observations 116, 187, and 202). Besides that, our residuals appear to be pretty darn linear, as is evidenced by the straight red line plotted through our residuals (if they were quadratic, for example, that line would resemble the shape of a parabola). It also looks like our data are homoscedastic, given that they appear evenly spread around the y = 0 line. Hooray! #Let's look at our second plot now. plot(Gobble.model.2,which=2) Plot 2: The normality assumption is evaluated based on the residuals and can be evaluated using a QQ-plot by comparing the residuals to “ideal” normal observations along the 45-degree line. How did we do? R automatically flagged those same 3 data points that have large residuals (observations 116, 187, and 202). However, aside from those 3 data points, observations lie well along the 45-degree line in the QQ-plot, so we may assume that normality holds here. #Let's look at our third plot now. plot(Gobble.model.2,which=3) Plot 3: The third plot is a scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to see if there is a pattern in the residuals. If the red line you see on your plot is flat and horizontal with equally and randomly spread data points (like the night sky), you’re good. If your red line has a positive slope to it, or if your data points are not randomly spread out, you’ve violated this assumption. How did we do? R flagged observations 116, 187, and 202. Besides that, we do see a horizontal line with randomly scattered data points around it, suggesting that the homoscedasticity assumption is satisfied here. #Let's look at our final plot now. #Note that this specific plot ID is considered '5' in R. That's why I called for which=5. plot(Gobble.model.2,which=5) Plot 4: The fourth plot helps us find influential cases, if any are present in the data. Note: Outliers may or may not be influential points. Influential outliers are of the greatest concern. They could alter the results, depending on whether they are included or excluded from the analysis. If you’re good and don’t have influential cases, you will hardly see a dash red curve, if at all (that red dashed curved line represents Cook’s distance). If you don’t see a red Cook’s distance curved line, or one is just barely peaking out of the corner or your plot but none of your data points are within it, you’re good. If you notice some of your data points cross that distance line, you’re not so good/ you have influential data points. How did we do? Observations 152, 187, and 202 stick out. However, they don’t cross Cook’s distance line (which isn’t even present in this plot), so we’re okay! #Alternatively, if you'd like to see all 4 plots side by side, try this: layout(matrix(c(1,2,3,4),2,2)) plot(Gobble.model.2) Overall evaluation: Observations 187 and 202 are flagged in each of the 4 plots above. You should consider looking at these observations more closely. Is there something weird about these subjects, or were there errors in data entry? Besides these potentially weird observations, however, our data look great! ## 3.2 Testing for Outliers: outlierTest() We can also use the outlierTest() function to detect outliers in the data. This function runs a Bonferonni adjusted outlier test. The null for this test is that the observation is NOT an outlier. # Assessing Outliers #Are there outliers? outlierTest(Gobble.model.2) # Bonferonni p-value for most extreme obs ## ## No Studentized residuals with Bonferonni p < 0.05 ## Largest |rstudent|: ## rstudent unadjusted p-value Bonferonni p ## 202 -3.130544 0.0019548 0.4887 How did we do? Observation 202 is the most extreme observation in our data, but it isn’t an outlier, as the null cannot be rejected (adjusted p > 0.05). ## 3.3 Testing for Influential Observations: Cook’s Distance The code below plots Cook’s distance, which is a measure of the influence of each observation on the regression coefficients. The Cook’s distance statistic is a measure, for each observation in turn, of the extent of change in model estimates when that particular observation is omitted. Any observation for which the Cook’s distance is close to 1 or more, or that is substantially larger than other Cook’s distances (highly influential data points), requires investigation. # Influential Observations # Cook's D plot # identify D values > 4/(n-k-1) cutoff <- 4/((nrow(dataset.gobble2)-length(Gobble.model.2$coefficients)-2))
plot(Gobble.model.2, which=4, cook.levels=cutoff)

How did we do? Observations 152, 187, and 202 appear to be the most influential observations in our data, confirming what we saw above when we used the plot() function to get a sense of whether any data points are influential (plot 4, specifically). However, note that plot 4 from the plot() function suggested that these observations were not truly influential on our regression results.

## 3.4 Testing the Normality Assumption: studres()

The studres() function does just what it sounds like: plots the distribution of residuals. We want it to look normally distributed, in accordance with the normality assumption.

# distribution of studentized residuals
sresid <- studres(Gobble.model.2)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)

How did we do? Looks nice n’ normal to me!

## 3.5 Testing the Homoscedasticity Assumption: ncvTest()

The ncvTest() provides us with another test of the homoscedasticity assumption. Specifically, this is a statistic that tests whether there is non-constant variance (i.e., heteroscedasticity). The null states that there is constant variance.

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(Gobble.model.2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0574495    Df = 1     p = 0.8105735

How did we do? p > .05, suggesting that our data is homoscedastic. Yay!

## 3.6 Testing the Linearity Assumption: crPlots()

The crPlots() function is great for visually seeing whether your predictors have a linear relationship with your dependent variable. You should see a green line that models the residuals of your predictor against your dependent variable (i.e., the loess line). The red line represents the line of best fit. If your green line seems to be similarly linear as your red line, you’re good. If the green line appears curved relative to the red line, you likely have a linearity problem.

# Evaluate Nonlinearity
# component + residual plot aka partial-residual plots
crPlots(Gobble.model.2)

How did we do? Great! Both red & green lines are nice and linear, as they should be.

## 3.7 Testing the Independence Assumption: durbinWatsonTest()

The Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption.

durbinWatsonTest(Gobble.model.2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.03984474      2.079617    0.54
##  Alternative hypothesis: rho != 0

How did we do? p > 0.05, so the errors are not autocorrelated. We have not violated the independence assumption. Hooray!

## 3.8 Global Test of Model Assumptions: gvlma()

Finally, if you are feeling lazy, you could always do a global test of model assumptions using the gvlma package. Check out (almost) everything using just one package! Just as the other plots suggested, all of our assumptions seem to be acceptable. Special note: There is not much input from the R user community about each of these decisions, and what exactly each decision means. User beware!

gvmodel.gobble <- gvlma(Gobble.model.2)
summary(gvmodel.gobble)
##
## Call:
## lm(formula = Sleepyyy ~ Turkeyyy, data = dataset.gobble2)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -6.2833 -1.4101  0.2003  1.3977  5.1238
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.289839   0.265924    4.85 2.18e-06 ***
## Turkeyyy    2.005301   0.009012  222.50  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.056 on 248 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.995
## F-statistic: 4.951e+04 on 1 and 248 DF,  p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05
##
## Call:
##  gvlma(x = Gobble.model.2)
##
##                      Value p-value                Decision
## Global Stat        3.05652  0.5484 Assumptions acceptable.
## Skewness           0.62165  0.4304 Assumptions acceptable.
## Kurtosis           0.03853  0.8444 Assumptions acceptable.
## Link Function      0.64649  0.4214 Assumptions acceptable.
## Heteroscedasticity 1.74985  0.1859 Assumptions acceptable.

## 3.9 Overall Evaluation

You can see that there are a LOT of diagnostics that you can run to check whether you’ve satisfied or violated any regression assumptions. Luckily, they are all pretty consistent with each other. In my perfect regression model above, we didn’t have any big issues with assumptions.

# 4 What about a more problematic example?

Let’s intentionally make the relationship between turkey consumption and nap time positively quadratic.

#set the seed. This ensures that rnorm will sample the same datapoints for you as it did for me.
set.seed(66)

N <- 250
TurkeyTime2 <- runif(N, 0, 50)
NapTime2 <- (TurkeyTime2^2 + rnorm(N, sd=16)) + 500
#I made nap time have a positive quadratic relationship with turkey so this example would be icky.
#I also added 500 to the equation to ensure we don't get a negative intercept for nap time.

#Let's make a data frame
dataset.gobble3<-data.frame(
Turkeyyy2 = TurkeyTime2,
Sleepyyy2 = NapTime2)

#Let's make things really interesting by adding some bad outliers to the data!
outliers<-data.frame(Turkeyyy2=c(45, 50), Sleepyyy2=c(3000,4000))
dataset.gobble3<-rbind(outliers, dataset.gobble3)
#binds 2 new outliers to the top of our dataset

str(dataset.gobble3)
## 'data.frame':    252 obs. of  2 variables:
##  $Turkeyyy2: num 45 50 49.5 41.5 29.3 ... ##$ Sleepyyy2: num  3000 4000 2933 2213 1363 ...
#note that our 2 outliers are at the top of the dataset now
head(dataset.gobble3)
##   Turkeyyy2 Sleepyyy2
## 1  45.00000 3000.0000
## 2  50.00000 4000.0000
## 3  49.49683 2933.0027
## 4  41.48463 2213.3863
## 5  29.29435 1362.8240
## 6  20.85231  941.4564

Let’s test our new ridiculous model (notice how long people are sleeping now…)

#use lm() function here to run your regression model
Gobble.model.3<-lm(Sleepyyy2~Turkeyyy2, data=dataset.gobble3)
summary(Gobble.model.3)
##
## Call:
## lm(formula = Sleepyyy2 ~ Turkeyyy2, data = dataset.gobble3)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -231.21 -162.12  -57.55  119.92 1368.81
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  14.6311    26.3289   0.556    0.579
## Turkeyyy2    52.3313     0.9236  56.658   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 202.1 on 250 degrees of freedom
## Multiple R-squared:  0.9277, Adjusted R-squared:  0.9275
## F-statistic:  3210 on 1 and 250 DF,  p-value: < 2.2e-16
#Stargazer package gave us a pretty summary table, right?!
#Note that it outputs a measure of effect size for the model (i.e., adjusted R-squared)
stargazer(Gobble.model.3,type="text")
##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                              Sleepyyy2
## -----------------------------------------------
## Turkeyyy2                    52.331***
##                               (0.924)
##
## Constant                      14.631
##                              (26.329)
##
## -----------------------------------------------
## Observations                    252
## R2                             0.928
## Residual Std. Error     202.078 (df = 250)
## F Statistic         3,210.114*** (df = 1; 250)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
#Want to save this table as a txt file?
#Set your working directory, then run code below
#The table will save to your working directory, and then you can open this file with any word processor
#stargazer(Gobble.model.3,type="text", out="gobble3.txt")

With our new model, it seems like greater turkey consumption still predicts greater nap time. But are our assumptions satisfied?

Let’s first visualize this relationship to get an intuitive grasp of the data.

plot(dataset.gobble3, main="Scatterplot of Thanksgiving",
xlab="Turkey Consumption in Grams ", ylab="Sleep Time in Minutes ", pch=19)

It sure doesn’t look linear, and we can see our outliers. But what about the other assumptions?

## 4.1 Getting a Broad Snapshot: plot()

Let’s run the plot() function to get a snapshot of how many of our assumptions are doing.

#Let's go through each plot one by one.
plot(Gobble.model.3,which=1)

Plot one: Are the data linear? No; the residuals shouldn’t be clustered together in the shape of a smile. They should be randomly scattered across the plot, and the red line should be horizontal/not curved. Are the data homoscedastic? They don’t appear to be, because the residuals aren’t equally spread around the y = 0 line. So far, it seems like we’ve violated the linearity and homoscedasticity assumptions.

#Let's look at our second plot.
plot(Gobble.model.3,which=2)

Plot two: How are we doing on the normality assumption? Not good; the QQplot is curved at the ends rather than straight along the dotted 45-degree angle line. This suggests we’ve violated the normality assumption.

#Let's look at our third plot.
plot(Gobble.model.3,which=3)

Plot three: Are our data homoscedastic? This one is tricky; the red line is pretty horizontal, suggesting that the data are in fact homoscedastic. However, note how our residuals are clustered in the shape of a W rather than looking randomly dispursed like the night sky. This isn’t good– it suggests that there is a lot of variation in residual dispersion around the line of best fit. In other words, the homoscedasticity assumption has been violated.

#Let's go through each plot one by one.
plot(Gobble.model.3,which=5)

Plot four: Do we have influential cases? Observation 2 is very close to crossing Cook’s distance dotted line (i.e., Cook’s distance for observation 2 is almost > 0.5), but it doesn’t actually cross. Strictly speaking, this plot suggests that our outliers aren’t influential.

#Or, if you'd like to see all 4 plots side by side, try this:
layout(matrix(c(1,2,3,4),2,2))
plot(Gobble.model.3)

Overall Evaluation: It’s no surprise that the linearity assumption is violated in this example (visible in plot 1). Moreover, plot 2 suggests that the normality assumption is violated, and plots 1 and 3 suggest that the homoscedasticity assumption is violated. Observations 1, 2, and 235 are outliers (but are not influential, as revealed in plot 4): They are flagged as outliers in each of the plots above. Check to see whether there is something weird about these subjects, or whether there were data entry errors. In sum, our model is a disaster!

## 4.2 Testing for Outliers: outlierTest()

We can also use the outlierTest() function to detect outliers in the data. This function runs a Bonferonni adjusted outlier test. The null for this test is that the observation is not an outlier.

# Assessing Outliers
#Are there outliers? Yes,
outlierTest(Gobble.model.3) # Bonferonni p-value for most extreme obs
##   rstudent unadjusted p-value Bonferonni p
## 2 7.560856         7.6852e-13   1.9367e-10

How did we do? It’s no surprise that observation 2 is an outlier (but do note that observations 1 and 235 were not flagged as outliers in this test). We reject the null that there are no outliers (p < .05).

## 4.3 Testing for Influential Observations: Cook’s Distance

Let’s check out Cook’s distance– it should flag the same observations as in plot 4 in the plot() function.

# Influential Observations
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(dataset.gobble3)-length(Gobble.model.3$coefficients)-2)) plot(Gobble.model.3, which=4, cook.levels=cutoff) How did we do? Observations 1, 2, and 235 appear to be the most influential observations in our data–the same ones flagged in the residuals vs. leverage plot above. Do note, however, that the Residuals vs. Leverage plot above (plot 4) suggests that these observations are not truly influential to our regression results. ## 4.4 Testing the Normality Assumption: studres() Are our residuals normally distributed? # distribution of studentized residuals sresid <- studres(Gobble.model.3) hist(sresid, freq=FALSE, main="Distribution of Studentized Residuals") xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit) Nope! Not surprised; our model is intentionally horrible! ## 4.5 Testing the Homoscedasticity Assumption: ncvTest() Are we homoscedastic? Remember: The null here states that there is constant variance. # Evaluate homoscedasticity # non-constant error variance test ncvTest(Gobble.model.3) ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 9.740168 Df = 1 p = 0.001802845 We’ve violated the homoscedasticity assumption (p < .05), and this conclusion is bolstered by what we saw in plots 1 and 3 from above. ## 4.6 Testing the Linearity Assumption: crPlots() Does our predictor have a linear relationship with the dependent variable? # Evaluate Nonlinearity # component + residual plot aka partial-residual plots crPlots(Gobble.model.3) No, our model doesn’t look linear. It looks like a quadratic relationship. ## 4.7 Testing the Independence Assumption: durbinWatsonTest() Independence of errors? The null states that they are not autocorrelated (what we want). durbinWatsonTest(Gobble.model.3) ## lag Autocorrelation D-W Statistic p-value ## 1 0.116833 1.723796 0.038 ## Alternative hypothesis: rho != 0 We’ve violated the independence assumption (p < .05). ## 4.8 Global Test of Model Assumptions: gvlma() Finally, let’s see what the gvlma package says! gvmodel.gobble3 <- gvlma(Gobble.model.3) summary(gvmodel.gobble3) ## ## Call: ## lm(formula = Sleepyyy2 ~ Turkeyyy2, data = dataset.gobble3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -231.21 -162.12 -57.55 119.92 1368.81 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.6311 26.3289 0.556 0.579 ## Turkeyyy2 52.3313 0.9236 56.658 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 202.1 on 250 degrees of freedom ## Multiple R-squared: 0.9277, Adjusted R-squared: 0.9275 ## F-statistic: 3210 on 1 and 250 DF, p-value: < 2.2e-16 ## ## ## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS ## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: ## Level of Significance = 0.05 ## ## Call: ## gvlma(x = Gobble.model.3) ## ## Value p-value Decision ## Global Stat 954.53 0.000e+00 Assumptions NOT satisfied! ## Skewness 141.28 0.000e+00 Assumptions NOT satisfied! ## Kurtosis 574.88 0.000e+00 Assumptions NOT satisfied! ## Link Function 222.03 0.000e+00 Assumptions NOT satisfied! ## Heteroscedasticity 16.35 5.265e-05 Assumptions NOT satisfied! This confirms what we saw above. We have major problems with all assumptions! Special note: There is not much input from the R user community about each of these decisions, and what exactly each decision means. User beware! # 5 What can we do to remedy this? We can do a bca bias correction using the boot() function in the car package. It corrects for assumptions violations. Note that the confidence interval isn’t symmetrical because our data isn’t perfectly normal. Thus, when you report them make sure to note that they are 95% bootstrap bca confidence intervals. Also note that the quantiles reported in the output are linear quantiles. gobble1 <- Boot(Gobble.model.2, R=1000) summary(gobble1) ## R original bootBias bootSE bootMed ## (Intercept) 1000 1.2898 -0.00734381 0.2739776 1.2825 ## Turkeyyy 1000 2.0053 0.00010075 0.0095508 2.0054 confint(gobble1) ## Bootstrap quantiles, type = bca ## ## 2.5 % 97.5 % ## (Intercept) 0.7506286 1.841724 ## Turkeyyy 1.9871179 2.024434 Note that once we correct for assumption violations, the 95% bootstrap bca confidence intervals suggest that the effect of turkey consumption on sleep time is significant. # 6 Introduction to Multiple Regression Let’s go over a very basic example of to introduce you to multiple regression in R. Let’s create a data frame that combines our turkey consumption, wine consumption, and sleep time variables. #set the seed. This ensures that rnorm will sample the same datapoints for you as it did for me. set.seed(99) N <- 250 TurkeyYum <- runif(N, 0, 50) WineYum <- runif(N, 0, 50) NapYum <- 2*TurkeyYum + rnorm(N, sd=0.5) #ID number IDMultiple<-factor(c(seq(1:N))) ### Let's put it all together dataset.gobble.multiple<-data.frame( subjectIDmultiple = IDMultiple, TurkeyYummy = TurkeyYum, WineYummy = WineYum, SleepyYummy = NapYum) str(dataset.gobble.multiple) ## 'data.frame': 250 obs. of 4 variables: ##$ subjectIDmultiple: Factor w/ 250 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $TurkeyYummy : num 29.24 5.69 34.21 49.63 26.75 ... ##$ WineYummy        : num  39.32 2.76 26.25 33.23 46.04 ...
##  \$ SleepyYummy      : num  58.3 11.3 68 99.4 54.1 ...
head(dataset.gobble.multiple)
##   subjectIDmultiple TurkeyYummy WineYummy SleepyYummy
## 1                 1   29.235593 39.319754    58.33437
## 2                 2    5.689084  2.757302    11.32146
## 3                 3   34.213237 26.246075    67.97108
## 4                 4   49.625439 33.232598    99.43534
## 5                 5   26.749679 46.043982    54.06836
## 6                 6   48.330703  7.676541    97.07028

Now let’s run a multiple regression model (defined as a regression model with 2 or more predictors) that considers whether turkey consumption AND wine consumption predict sleep time. What you’ll notice in the model below is that it generally takes the same form as that of simple linear regression: “Name your model” <- lm(Criterion~Predictor1 + Predictor2, data=“name of your dataset”). Now, however, I have 2 predictors and there is a + sign in between them. What this means is that the predictors are entered simultaneously into the model, and the output reflects the effect of each on the criterion, controlling for the other.

Gobble.model.multiple<-lm(SleepyYummy~TurkeyYummy+WineYummy, data=dataset.gobble.multiple)
summary(Gobble.model.multiple)
##
## Call:
## lm(formula = SleepyYummy ~ TurkeyYummy + WineYummy, data = dataset.gobble.multiple)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.5216 -0.3478 -0.0511  0.3764  1.2531
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0028384  0.0856371  -0.033    0.974
## TurkeyYummy  2.0003258  0.0023184 862.815   <2e-16 ***
## WineYummy   -0.0004085  0.0021686  -0.188    0.851
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5164 on 247 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997
## F-statistic: 3.723e+05 on 2 and 247 DF,  p-value: < 2.2e-16

What does this mean? It means that turkey is a significant predictor of sleep time, controlling for wine consumption: The more turkey one consumes, the longer one naps. However, wine is not a significant predictor of sleep time after controlling for turkey consumption; they are unrelated.

# 7 REFERENCES

The following websites were very helpful to me while writing this chapter.

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression7.html (author: Boston University School of Public Health)

http://www.vikparuchuri.com/blog/r-regression-diagnostics-part-1/ (author: Vik Paruchuri)

http://data.library.virginia.edu/diagnostic-plots/ (author: Bommae Kim, University of Virginia)

Brush up on your regression diagnositics options: https://cran.r-project.org/web/packages/car/vignettes/embedding.pdf