1 Introduction

The following chapter is a step by step guide for novice R users in the art of making boxplots and bar graphs, primarily using the ggplot2 package. R is capable of a lot more graphically, but this is a very good place to start.

1.0.1 What are boxplots good for?

Boxplots are an easy visual way to depict quartiles in your data. They are sometimes referred to as box-and-whisker-plots as they often have lines extending from the box data, which denote additional variablity outside of these quartiles. Boxplots allow us a simple way to compare groups and view dispersion and spread in data. They also help highlight outliers.

1.0.2 What are bar graphs good for?

Bar graphs (aka: bar charts or bar plots) present grouped data in a rectangular format. They allow us to see and compare groups or categories based on the scores (often means) of another (often continuous) variable.

1.0.3 The example in this chapter

In the current chapter I will be using a simple fake dataset that I made for the purposes of this chapter. When researching how to make these different plots and graphs, most of the information I found online used examples without real data. Personally, I find these types of examples difficult to follow and I wanted to create something different. This way, if you get confused about the data, you can actually look at it. This dataset contains data from 50 fake participants assessing their self-rated funniness. Essentially, these fake individuals were asked to rate on a scale from 1 to 100 - “How funny do you think you are?”- with 100 indicating the highest degree of funniness and 1 indicating the lowest degree of funniness. Additionally, their age, gender (Male or Female), and level of education (College Grad or Not College Grad) were recorded.

2 Getting Started - Review of important Basics

See Chapter 1 for more details on topics discussed in this section.

2.0.1 Set your working directory

The setwd() function assists you in setting your working directory. This information tells R where the files you are using are located. Make sure your working directory path is in quotations and in parenthesis (as shown below).

setwd(“/Users/Documents/Chapt9Tutorial”)

2.0.2 Install and load necessesary packages

Use the install.packages() function to install necessary packages. Make sure that the name of the package once again is in quotations and paranthesis (as shown below). To make boxplots and bar graphs, you will need the plyr and ggplot2 packages.

#install.packages("plyr") # This package is useful as a data manipulator 
#install.packages("ggplot2") #This package is useful for visualizing data

To load these packages from your library of installed packages use the library() function (as shown below). No quotations are needed in the parantheses here.

library(plyr)    
library(ggplot2)  

2.0.3 Read in your data file

To read in your data files to use in R, use the read.table() function. This function may require additional commands to indicate how the data is set up. For example, in this case the data is seperated by commas and the file contains a header row (T = True). The name of the data file used in this example is Humordata_50.csv, and I am calling it HumorData to make working with it a bit simpler.

HumorData <- read.table("Humordata_50.csv", sep=",", header=T)

3 Viewing and Manipulating the data

3.0.1 Looking at your data

Even though this chapter focuses on boxplots and bar graphs, I would never underestimate the utility of looking closely at your data before creating any plots. You can look at the structure of the data (or any object) using the str() function. It will briefly give you the variable names, # of observations and # of variables, type of variable (integer, factor, numeric, etc). For a data frame, it gives the number of cases and variables, the name and type of each variable, and the first several values of each.

str(HumorData)
## 'data.frame':    50 obs. of  5 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender   : int  1 2 2 1 1 2 1 2 1 1 ...
##  $ College  : int  1 1 1 2 1 1 2 2 1 2 ...
##  $ Age      : int  87 52 72 53 30 34 62 35 76 86 ...
##  $ Funniness: int  79 79 58 22 98 55 26 43 34 32 ...

If you are only interested in viewing particular aspects of the data, you can use several other functions. The dimensions of your data can be found using the dim() function. The number of cases (rows) and variables (columns) specifically can be found with nrow() and ncol().

dim(HumorData) #number of participants number of variables
## [1] 50  5
nrow(HumorData) #number of participants
## [1] 50
ncol(HumorData) #number of variables
## [1] 5

If you want to look at a particular variable in your dataset, like in this case, the age of participants, type in the name of your data then a $ and the name of the variable of interest.

HumorData$Age
##  [1] 87 52 72 53 30 34 62 35 76 86 -1 63 87 68 42 89 65 69 80 46 38 86 66
## [24] 75 26 55 76 86 21 40 35 25 63 58 31 22 68 26 63 70 48 29 86 19 29 21
## [47] 28 25 26 62

3.0.2 Attending to missing data

To get rid of missing data points you can use the ifelse() function. In the example below, I am dealing with the missing data in my age variable. In this case, for those participants who did not provide an age, -1 was recorded to indicate missing. The code below is indicates that I am addressing the Age variable in my HumorData dataset (HumorData$Age), and that the value to be labeled as missing or NA is -1.

HumorData$Age<-ifelse(HumorData$Age==-1,NA,HumorData$Age)

3.0.3 Make variables into factors

It is important to make sure that R knows that any categorical variables you are going to use in your plots are factors and not some other type of data. If you are unsure if a variable is already a factor, double check the structure of your data (see above). The categorical variables in my data are Gender and College, yet they are currently not structured as factors. Instead, they are structured as integers. The as.factor() function converts a variable into a factor. Once again, use the $ symbol to tell R which variable in your datasset you want to change. See below.

HumorData$Gender<-as.factor(HumorData$Gender)
HumorData$College<-as.factor(HumorData$College)

Now that my categorical variables are factors, I need to properly label them. In my gender variable, 1=Male and 2=Female, and in my college variable 1=College Grad and 2=Not College Grad. I am using the factor() function. Within that function, I am telling R which variable to change in my HumorData, what the levels currently are, and how I want to label those levels.

HumorData$Gender<-factor(HumorData$Gender,
                          levels=c(1,2),
                          labels=c("Male", "Female"))

HumorData$College<-factor(HumorData$College,
                         levels=c(1,2),
                         labels=c("College Grad", "Not College Grad"))

To double check that this all worked properly, take another look at the structure of your data, and you should be able to see a change in the variable type and labeling for the Gender and College variables.

str(HumorData)
## 'data.frame':    50 obs. of  5 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender   : Factor w/ 2 levels "Male","Female": 1 2 2 1 1 2 1 2 1 1 ...
##  $ College  : Factor w/ 2 levels "College Grad",..: 1 1 1 2 1 1 2 2 1 2 ...
##  $ Age      : int  87 52 72 53 30 34 62 35 76 86 ...
##  $ Funniness: int  79 79 58 22 98 55 26 43 34 32 ...

3.0.4 Quick descriptive information

When preparing to make boxplots and bar graphs it can be useful to look at frequencies and/or descriptive summaries of the variables you intend to plot. For example, I may want to see the gender or college graduation breakdown for my sample, or a quick distribution of how funny people think they are. Combining the with() function with the summary() function, as shown below, can offer some quick descriptive information about both categorical and continuous variables.

with(HumorData, summary(Gender))
##   Male Female 
##     26     24
with(HumorData, summary(College))
##     College Grad Not College Grad 
##               29               21
with(HumorData, summary(Age))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   19.00   30.00   55.00   53.04   70.00   89.00       1
with(HumorData, summary(Funniness))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   42.25   60.00   60.54   79.00   98.00

You can also futher break apart variables to see how they may vary as a function of another variable. For example, below I investigate how mean self-rated funniness differs based on the levels of Gender (Male vs. Female) and College (College Grad vs. Not College Grad). Here I use the with() function again in combination with the by() function. I have also chosen to name this information (GenderMeans and CollegeMeans) as objects in case I want to use them later.

GenderMeans = with(HumorData, by(Funniness, Gender, mean))
GenderMeans 
## Gender: Male
## [1] 64.53846
## -------------------------------------------------------- 
## Gender: Female
## [1] 56.20833
CollegeMeans = with(HumorData, by(Funniness, College, mean))
CollegeMeans 
## College: College Grad
## [1] 69.24138
## -------------------------------------------------------- 
## College: Not College Grad
## [1] 48.52381

In addition, you can use the tapply function to look at the different groups. See Chapter 3 for more details on apply functions.

tapply(HumorData$Funniness,
        list(HumorData$Gender,HumorData$College), mean)
##        College Grad Not College Grad
## Male       78.50000         42.20000
## Female     57.84615         54.27273

This infomation helps me to see how self-rated funniness does differ based on gender and college. Therefore, I have confirmed my interest in making plots to display this data.

4 Boxplots

4.0.1 Basic Boxplots

It is very simple to make a basic boxplot. Below I have made two basic box-plots looking at how self-rated funniness differs based on gender and college education. Use the ggplot() function and within that you need to describe the aesthetics or aes. I am indicating which variables I want on which axes below - I want gender (or college) as the variables on the x axis and self-described funniness as the variable on the y axis. I am also telling it what plot to make with geom_boxplot().

GenderPlot1 = ggplot(HumorData, aes(x = Gender, y = Funniness)) + geom_boxplot() 
GenderPlot1

CollegePlot1 = ggplot(HumorData, aes(x = College, y = Funniness)) + geom_boxplot()
CollegePlot1

Just a quick reminder now that you have seen this first boxplot that boxplots or box-and-whisker-plots are a visual depiction of quartiles, dispersion, and spread in your data, and they have lines extending from the box to show the range of additional variablity outside of these quartiles. Moreover, they offer a way to compare variable levels or groups. They can also help highlight outliers.

If you would like to flip these box-plots from a vertical orientation to a horizontal orientation, the code is almost exactly the same. The only addition is adding + coord_flip() to the end of the phrase.

GenderPlot1_FLIP = ggplot(HumorData, aes(x = Gender, y = Funniness)) + geom_boxplot() + coord_flip()
GenderPlot1_FLIP 

CollegePlot1_FLIP = ggplot(HumorData, aes(x = College, y = Funniness)) + geom_boxplot() + coord_flip()
CollegePlot1_FLIP

Moreover, you can make boxplots to get a visual of a single variable by making a fake grouping variable. Simply add xlab(“”) and scale_x_discrete(breaks = NULL) to the end of the phrase of code.

FunnyPlot = ggplot(HumorData, aes(x = factor(0), y = Funniness)) + geom_boxplot() + xlab("") +
  scale_x_discrete(breaks = NULL)
FunnyPlot

Once again the previous plot can be flipped by adding coord_flip() to the end of the phrase.

FunnyPlot_FLIP = ggplot(HumorData, aes(x = factor(0), y = Funniness)) + geom_boxplot() + xlab("") +
  scale_x_discrete(breaks = NULL) + coord_flip()
FunnyPlot_FLIP 

4.0.2 More Attractive Boxplots

The following box plots are slightly more advanced (and attractive) as they are further customized to include more detailed elements.

4.0.2.1 Labeled Boxplots

The following boxplot is showing the same information as the basic CollegePlot1 above. However,it is using a different funciton within the ggplot2 package. Here we use the qplot() function and fill it with my grouping variable (College), my continous variable(Funniness), the data (HumorData), and we specify the type of plot with geom =(“boxplot”). Then to add on, I have used the main command to add a title to the chart “Humor Chart” and the xlab and ylab commands to label the x and y axes.

CollegePlot2 = qplot(College, Funniness, data=HumorData, geom=("boxplot"), 
                   main="Humor Chart",xlab="Education Level", ylab="Self-Rated Funniness")
CollegePlot2 

Note: As you may have noticed, I used the qplot() function instead of the ggplot() function here (and in the plot below). They are similar functions that accomplish many of the same things. Overall, qplot() is quicker, easier to navigate, but less flexible than ggplot(), which is more customizable but also more advanced and complex.

4.0.2.2 Jitter Boxplots

We can expand upon the code from above to further customize our boxplots. The plot below is a jitter boxplot, meaning that the actual data points of self-rated funniness are overlayed on the plot. Following the same instructions as above (this time using Gender instead of College) but combining geom=c(“boxplot”, “jitter”) we can create a jitter boxplot.

JitterPlot = qplot(Gender, Funniness, data=HumorData, geom=c("boxplot", "jitter"), 
                   main="Humor Chart",xlab="Gender", ylab="Self-Rated Funniness")
JitterPlot 

4.0.2.3 Faceted Boxplots

Faceted plots are useful if you want to essentially look at two different boxplots at the same time but divided by the levels of one of your categorical variables. There are many times when you may want a boxplot that looks at the potential interaction of two categorical variables. Here I am looking at how self-percived funniness may differ as function of both gender and education. Using the ggplot () function. The following code is very similar to the simple boxplot code from our original GenderPlot1. I have simply added to the end facet_grid(~College) indicating that College is the variable that I want R to use to divide up the boxplots.

FacetPlot1 = ggplot(HumorData, aes(x=Gender, y=Funniness)) + geom_boxplot() + facet_grid(~College) 
FacetPlot1

4.0.2.4 Customizing Outliers

Boxplots can be useful in identifying outliers in your data. Outliers are any data points that fall outside of the whiskers on the plot, and they are depicted with dots. You may have noticed two dots in the boxplot above. They denote two male participants among the college graduates whose self-rated funniness is low for this group. To change the look of those dots, you can add info to the geom_boxplot() command. Here I have specified the size, shape, and color of my outliers, and consequently made the corresponding dots larger than the default, hollow diamond shaped, and purple.

FacetPlot2 = ggplot(HumorData, aes(x=Gender, y=Funniness, label=ID)) + geom_boxplot(outlier.size=3,outlier.shape=5,outlier.colour="purple") + facet_grid(~College) 

FacetPlot2

4.0.2.5 Adding Color to Your Plots

Expanding on the code from FacetPlot1, I want to tell R that I would like it to color in the boxplots instead of the same boring white. by adding fill=Gender to the aes command I am telling R to fill the color in with the default color. The scale_fill_brewer command tells R what color palette I want it to use if I would like it to be different from the default.

FacetPlot3 = ggplot(HumorData, aes(x=Gender, y=Funniness, fill=Gender)) + geom_boxplot() + facet_grid(~College) + scale_fill_brewer(palette = "Set1")
FacetPlot3

Color use in R could be its own chapter. As I will not be able to do this aspect of plots and charts justice, please see other resources on the internet for help, including this color chart - http://research.stowers-institute.org/efg/R/Color/Chart/ or the following R color cheat sheet which describes different color packages you can install - https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

4.0.2.6 Boxplots Using Multuple Categorical Variables Without Facets

I can also remove the facet (but maintain the presence of both categorical variables = Gender & College) by just removing the facet_grid command. Also, as you can see I have removed the scale_fill_brewer command and we still have colors (as now R is showing us the default colors).

CombinedPlot=ggplot(HumorData, aes(x=College, y=Funniness, fill=Gender)) + geom_boxplot() 
CombinedPlot

4.0.2.7 Notched Boxplots

Notched box-plots may be useful in some instances. Not only are they fun to look at, but if two boxes’ notches do not overlap this is ‘strong evidence’ their medians differ (Chambers et al., 1983, p. 62).

To make a plot notched, simply add notch=TRUE to the geom_boxplot command.

CombinedPlotNOTCH=ggplot(HumorData, aes(x=Gender, y=Funniness, fill=Gender)) + geom_boxplot(notch=TRUE) + facet_grid(~College)
CombinedPlotNOTCH

The notches, which start where the sides begin to slope inward, are similar in function to a confidence interval. When looking at the college graduates, it appears as if no part of the notches overlap for male and female participants. However, they do appear to overlap in the people who did not graduate from college. This suggests that there is a real difference in self-rated funniness between men and women who went to college, but this pattern was not demonstrated among those who did not go to college. Nevertheless, I would not soley rely on this visual. If this question is important to you, I would also conduct statistical analyses to be sure (for example you could use an ANOVA like in Chapters 20 & 21).

5 Bar Graphs

5.0.1 Preparing the data

Bar graphs are different from boxplots in that the data you use sometimes needs to be in vector or matrix format.

For example you can just make a quick vector using the c() function.

Vector1 <- c(13,29,23,35,16,20)

We can also use the means from the humor example to demonstrate how to do this with real data. One quick note here, you may notice that I added a backslash followed by n to the middle of the names in the HumorGroups vector below. This notation tells R to break up the name at that spot. It can be a trick to make graphs look nicer if you have long category names.

HumorMeans <- c(81,59,40,60)
HumorGroups <- c("College\nGrad Men","College\nGrad Women","Not College\nGrad Men","Not College\nGrad Women")

5.0.2 Making Simple Bar Charts

You can use the barplot() function to make simple bar graph with the original vector we made. I realize that the primary purpose of this chapter is to help people use the ggplot2 package to make boxplots and bar graphs, however, I think it is really useful to also know how to make really simple graphs in this way as well. The need to use different sources of data can change how easy it is to make graphs with different packages and functions.

Here is the super simple bar graph using Vector1.

SimpleBar = barplot(Vector1)

However, there is not much use for such a simple bar graph. With the addition of a few more commands, you can use the same function to make a more useful graph with the humor vectors we made earlier. In the parentheses first place the vector with the values you want in the graph, next establish the names to identify the groups these means belong to with the other vector (names.arg = HumorGroups). Then you can label and change the color of the graph. I have added color to the plots with the col=“green” command. In this context, typing in the name of most colors will work (e.g. “black”, “red”, “yellow”). Here there is the option to change the color of the border of the bars. I have also added in a label for the whole chart using the added command of main = “Humor Chart,” and labels for the x and y axes using the xlab and ylab commands.

SimpleBar2 = barplot(HumorMeans,names.arg = HumorGroups,xlab = "Gender and Education Groupings",ylab = "Self-Reported Funniness",col = "green",main = "Humor Chart",border = "black")

5.0.3 Creating Stacked Bar Graphs

As bar graphs get more complicated, ggplot() is a more useful function. You will not need to create vectors or matrices to deliniate the data points to use. Instead, you can tell the function to use the means you want using the command geom_bar(stat=“summary”, fun.y=“mean”). Once again, I have changed the color palatte here and added a title using ggtitle(“Humor Chart”).

StackedBar = ggplot(HumorData, aes(Gender, Funniness, fill = College)) + 
  geom_bar(stat="summary", fun.y="mean") + 
  scale_fill_brewer(palette = "Set2") + ggtitle("Humor Chart")
StackedBar

5.0.4 Creating Grouped Bar Graphs

Creating grouped bar graphs is pretty simple once you have a sense of how other bar graphs work. R creates stacked graphs as default, so you have to add in position = “dodge” to the geom_bar command.

GroupedBar = ggplot(HumorData, aes(Gender, Funniness, fill = College)) + 
  geom_bar(stat="summary", fun.y="mean", position = "dodge") + 
  scale_fill_brewer(palette = "Set2") + ggtitle("Humor Chart")
GroupedBar

5.0.5 Saving Your Work

After going through the trouble to create boxplots and bar graphs, you may want to save them for outside use. First, you need to decide what type of graphics file you want. For a PDF use the pdf() funcion, for PNG use png(), and for JPG use jpg(). Within the parentheses of this function, you indicate what you would like to name the file. For the next line of code, you need to identify the plot you want to save with the plot() function. Lastly, complete the saving process by adding dev.off(). Then you should see your saved file in your working directory! See below for an example using the last bar graph we made above:

png(filename="GroupedBarGraph.png")
plot(GroupedBar)
dev.off()

6 More Resources

This is only a brief introduction to using the ggplot2 packcage. For additional types of plots and plot customization tips see Chapter 8. Also, check out the following links for additional web resources for making boxplots and bar graphs:

