Before we begin, ensure that you have the following package loaded in order to create scatterplots and density plots as outlined below. ggplot is used to make graphs and is essential to run the below commands. Note that the version of ggplot that we will be using is Version 2.
Once these packages are installed, ensure they are loaded into R using the library command:
library(ggplot2)
Prior to generating scatterplots and density plots, we must load the data we are interested in graphing into R. In this example we will be working with totally made up data that lists the number of birds and bird species by a given location’s seasonal temperature. We also have embedded in this dataset information as to whehter each location is qualifed as a Temperate or Tropical climate. Note that the number of birds, bird species, and temperature are all continuous variables, making a scatterplot an appropriate graph to use in order to investigate relationships among these variables. While the variable indicating the climate of each location is a nominal variable, we can use this information later in order to view scatterplots side-by-side dependening on this classification.
As a first step, ensure that you have set your working directory to where your dataset is stored. Note that this is the same directory where you’ll be saving any graphs that you create.
You can set your working directory using the setwd command:
# setwd('/insert data directory here')
As a second step, ensure that your dataset is saved as a .csv file.
In order to load in our data, we will use the read.table command. This command requires us to call our data a variable name so that we can query this data at a later timepoint. In the example below I am calling our dataset Avian.Data.
The read.table command asks for three pieces of information:
Avian.Data = read.table("Chapter_10_data.csv", sep=",", header=T)
After running the above command, you should notice that an object called Avian.Data was created in RStudio’s Global Environment. You can open this dataset and display it as it would appear in Excel by double-clicking this object in the Global Environment pane located in RStudio.
You can also confirm that the data loaded into R correctly by using the summary command, which displays descriptive statistics for each variable in our dataset. In the summary command, be careful to call for the dataset as we’ve identified it to R in our read.table command, in this case Avian.Data
summary(Avian.Data)
## US.sites Number.of.birds Number.of.species Temperature
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. :50.00
## 1st Qu.: 8.25 1st Qu.:11.25 1st Qu.: 9.00 1st Qu.:57.75
## Median :15.50 Median :23.50 Median :24.50 Median :67.50
## Mean :15.50 Mean :28.43 Mean :22.80 Mean :66.53
## 3rd Qu.:22.75 3rd Qu.:39.50 3rd Qu.:33.75 3rd Qu.:72.00
## Max. :30.00 Max. :85.00 Max. :45.00 Max. :88.00
## Climate
## Temperate:19
## Tropical :11
##
##
##
##
In the below instance using the command, ggplot, we will call for this dataset while creating the most basic elements of a scatterplot: the x-axis, y-axis, and data points. Note that if you have not properly installed and loaded the gpplot program you will receive an error message below.
In the general structure of the ggplot command, we will provide three pieces of information:
In this command, GGplot will also plot our data points using the portion of the command geom_point(). Later on, we will re-visit this portion of the command in order to customize colors of these data points.
Note that the x-axis and y-axis names above are calling for the way these variables are labeled in the dataset itself. In a moment, we will cover to how to change how these variables are displayed in our scatterplot.
Also note that if we have not properly loaded our data, as specified above, a scatterplot will not generate and we will receive an error message.
Scatter.1 = ggplot(data = Avian.Data, aes(x = Number.of.birds, y = Temperature))+
geom_point()
Notice that while we have run the above command to create this scatterplot, titled Scatter.1, we do not see a scatterplot generated anywhere. This is because we have not yet asked R to display this scatterplot. To do this, we must type Scatter.1 as a command in order to run this independently, as displayed below. Running this command will automatically generate the scatterplot.
Scatter.1
To create another scatterplot using different variables, simply change the variable names in the x- and y-axes.
Scatter.2 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()
Remember to call for every new scatterplot in order to view it in R.
Scatter.2
A common mistake in the generation of scatterpoints is in the placement of the + symbol. Note that in the above examples the + is located at the end of the first line of command and does not start the next line of command. If you have placed the + at the beginning of the second line of command, your scatterplot will not generate and you will receive an error message, as demonstrated below:
#Scatter.2 = ggplot(data = Avian.Data, aes(x = Number.of.birds, y = Temperature))
# +geom_point()
#Error in +geom_point(): invalid argument to unary operator calls
To save this, or any scatterplot as a JPEG impage, we can insert jpeg() command before any command used to create the scatter, putting a name for the image file within the parentheses. We must then add dev.off() with nothing embedded in the parentheses after our command to create the scatter so that we create the scatterplot completely. In the below example I am creating the scatterplot as we just have done and saving it as Scatter1.jpg in my working directory. Note we must also remove the name of our scatterplot to the left of the parentheses when using these commands, as depicted below.
jpeg('Scatter1.jpg')
ggplot(data = Avian.Data, aes(x = Number.of.birds, y = Temperature))+
geom_point()
dev.off()
An alternative way to save any scatterplot you generate is to use the command ggsave(). ggsave() is a quick, easy to use way to save any graph generated in R, but note that it will only save the most recent graph created. You can customize the file type of the graph by changing the extension of your desired file name. For instance, in the below example I am specifying that I want to save Scatter1 as a PNG file. The ggsave() command recognizes the extensions eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (Windows only).
ggsave(file="Scatter1.png")
To confirm that your scatterplot was saved as an image file, navigate to your working directory to find it there.
When customizing a scatterplot, we can add bits of command to the most basic code listed above. By doing so we are re-writing the scatterplot in each iteration. Therefore, if you would like to create different versions of each scatterplot that has different formatting, you must name the scatterplot a different name each time.
Sometimes it is helpful to remove the gray background that is automatically generated in R. To do this, add +theme_bw() to any scatterplot. Notice that from here on out we have immediately called for each scatterplot by entering its name as a command following the generation of the plot.
Scatter.3 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()
Scatter.3
You can also change the shape of the data points by adding in shape = inside the command for geom_point(). Numeric values represent different types of shapes. For a full listing of available shapes, see: http://www.sthda.com/english/wiki/ggplot2-point-shapes In the example below, a numeric value of 1 indicates we want hollow circles as our data points.
Scatter.4 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(shape = 1)+
theme_bw()
Scatter.4
To remove the gridlines that are automatically generated in R, add +theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) to any scatterplot.
Scatter.5 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Scatter.5
Like we saw earlier, when asking R to graph the relationship between variables we use the variables as they are listed in the dataset. But we can change the way these variables are labeled on the scatterplot by adding +xlab(“Number of Bird Species”)+ylab(“Temperature”) to any scatterplot. Note that this command is re-naming the x-axis to Number of Bird Species and the y-axis to Temperature. Ensure that you are re-naming the axes appropriately and not mixing them up.
Scatter.6 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")
Scatter.6
To change the minimum, maximum, or the intermediate points of the scale displayed on either the x- or y-axes, respectively, add +scale_x_continuous(breaks = …). This command requires that you enter three numbers in the seq() portion of the command:
Scatter.7 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.7
You’ll notice that although I changed the maximum value of our y-axis to 90 in the above example, the tick-mark for 90 is missing. To fix this we can force the y-axis scale to end at 91 by inserting the commnand cord_cartesian(ylim = c(50, 91)), which should display the tick-mark for 90 appropriately.
Scatter.8 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 95, 5))+
coord_cartesian(ylim = c(50, 91))
Scatter.8
To change the color of all data points to one color, we can replace aes() inside the parentheses of the geom_point() scatterplot command with color=. R will recognize basic color commands like “red”, “blue”, and “green” but you can input any HTML color and R will recognize it. Find HTML color codes here: http://htmlcolorcodes.com/
In the below example we are applying the color red by inserting #EA4931 after the = in color=. Note that this color code is listed within single quotations and includes the # symbol.
Scatter.9 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(color='#EA4931')+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.9
We can change the color of our data points to blue using color code #314DEA.
Scatter.10 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(color='#314DEA')+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.10
We can also change the color of our data points to different colors based on another variable. In this dataset we can plot the relationship between Number of Species and Temperature by the variable Climate by colorizing the data points by color based on Climate. To do this we add aes() back to our geom_point() command and specify that we’d like to list the colors by the Climate variable.
Scatter.11 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(aes(color=Climate))+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.11
We can also add color to our datapoints based on another continuous variable by adding color = within our first line of code and +scale_color_gradientn(colors = rainbow(5)). This first change specifies that we’d like to add color based on our third variable, in this case the Number.of.birds variable. This second change adds code specifying the color gradiant we’d like to use, in this case a rainbow consisting of 5 colors. We can customize how many colors this rainbow spectrum uses by changing the numeric value.
Scatter.12 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature, color = Number.of.birds))+
geom_point()+
scale_color_gradientn(colors = rainbow(5))+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.12
Rather than display data points colorized by another continuous variable, we can weight these data points to also visually investigate this relationship. To do so we add a specificiation size = rather than color = like we did above. Again, we will look at the relationship between Number of Species and Temperature with data points weighted by the Number of Birds.
Scatter.13 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature, size = Number.of.birds))+
geom_point()+
scale_color_gradientn(colors = rainbow(5))+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.13
To visualize the relationship among these three continuous variables using both size and color, we simply add back our specification regarding color =
Scatter.14 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature, size = Number.of.birds, color = Number.of.birds))+
geom_point()+
scale_color_gradientn(colors = rainbow(5))+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.14
If we want to view the scatterplot organized by a nominal variable, we can add what R calls facets by adding +facet_grid(.~ ) to our scatterplot command. After the .~ symbol we need to specify the name of the other variable that we want to split our data by, as it appears in our dataset In this example we are splitting the data by our variable Climate.
Scatter.15 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))+
facet_grid(.~Climate)
Scatter.15
We can change the orientatin of our facets to make them appear horizontal by switching the placement of the .~ inside our facet_grid() command. Note that the default faceting orientation is vertical and that placing the .~ symbols before our variable Climate we instruct R to orient the facets vertically. By placing ~. after our variable CLimate, the facets will appear horizontally. Note that the ~ symbol is always placed closest to our variable and switches order in the command when calling for facets to appear horizontally.
Scatter.16 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))+
facet_grid(Climate~.)
Scatter.16
To add a loess line to any scatterplot to better visualize relationships, we can add the +geom_smooth(color=‘black’) to our scatterplot code. The loess line will plot a smoothed line through our set of data points in order to visualize the relationship among these data points.
Note we can change the color of this line in the same way we changed the color of our data points. In this example I am plotting a black line. Within the code below I have listed the command for channging the color of data points next to the command for creating the line of best fit next to each other for simple viewing.
Scatter.17 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(aes(color=Climate))+
geom_smooth(color='black')+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.17
If the loess line reveals that a linear relationship exists among the data points then we can add a linear line to any scatter by adding method = “lm” to this command, to indicate that we would like to fit a linear model (“lm”). By adding se = TRUE we specify that we would like to also see shaded area representing the standard error (“se”).
Scatter.18 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(aes(color=Climate))+
geom_smooth(method = "lm", se = TRUE, color='black')+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.18
If we would like to get rid of the standard error, we change this code to se = FALSE.
Scatter.19 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(aes(color=Climate))+
geom_smooth(method = "lm", se = FALSE, color='black')+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.19
Note we can change the type of line we are fitting to the data mathematically, changing this into a quadratic or cubic term for example. More information on how to transform the best fitting line can be found here: http://stackoverflow.com/questions/14927004/add-fitted-quadratic-curve
We can also add lines for each sub-group of another variable, like Climate in this dataset. To do so we add back aes() and specify that color = should equal our other variable.
Scatter.20 = ggplot(data = Avian.Data, aes(x = Number.of.species, y = Temperature))+
geom_point(aes(color=Climate))+
geom_smooth(method = "lm", se = FALSE, aes(color=Climate))+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
xlab("Number of Bird Species")+ylab("Temperature")+
scale_x_continuous(breaks = seq(0, 50, 5))+
scale_y_continuous(breaks = seq(50, 90, 5))
Scatter.20
Density plots can be a useful tool to quickly visualize relationships between data. By plotting a density plot we visualize the proportion of data points that resides in one variable and, by plotting multiple density plots on top of each other, can see if these proportions overlap. If density plots do not overlap, this is an indicator that there is variability that is dependent on levels of the variable we plotted. Note that the creation of density plots using ggplot uses many of the same embedded commands that were customized above.
The most commonly customizable feature of the density plot is the opacity of the fill color used to plot the data distribution, utilizing the geom_density command. In the below example we are plotting the distribution of the Number of Birds as function of Climate.
Density.1 = ggplot(data = Avian.Data, aes(Number.of.birds, fill=Climate))+
geom_density(alpha = 1.0)+
xlab("Number of Birds")+
theme_bw()
Density.1
While, the distributions of birds overlaps dependent on the location’s climate, this is difficult to see due to the density of the color fill. Note, the color fill above was listed at 1.0. If we change this to 0.3. we will make it easier for us to visualize this overlap.
Density.2 = ggplot(data = Avian.Data, aes(Number.of.birds, fill=Climate))+
geom_density(alpha = 0.3)+
xlab("Number of Birds")+
theme_bw()
Density.2
We can also plot the distribution of Number of Species by Climate. Note, when doing this, ensure to change the variable we are plotting in the first line as well as the axis label in the third line of code.
Density.3 = ggplot(data = Avian.Data, aes(Number.of.species, fill=Climate))+
geom_density(alpha = 0.3)+
xlab("Number of Species")+
theme_bw()
Density.3
Based on these plots, it appears that climate seems to matter for the number of bird species in a given location, but not the number of birds.
Resources on using R to create Scatterplots and Density Plots: