Welcome! If you’re here you probably want to learn how to put stuff into other stuff, or take stuff out of stuff and do stuff with that stuff. I will be your guide on that journey. In this tutorial, I would like to accomplish two basic goals:
1. Provide basics on how to pull information out of lists, vectors and data frames
2. Use that information to perform different functions
The only package you’ll need for this excursion is datasets
, which can be downloaded here.
library(datasets)
In Chapter 1, we learned that R is a very flexible tool that can deal with all kinds of data. More specifically, R stores data as
* logicals (True and False values, or Boolean)
* numeric (or double)
* character (also factors which are special cases of this where the values the element can take are limited)
I will mainly speak in terms of numeric, character, and logical data, but there are many further distinctions that can be made there (i.e between integers and numeric, character and factor) that are beyond the scope of this chapter. These different types of data can be stored in different structures. If you only have one type of data (i.e. a collection of numbers), and your data are one dimensional, your values can be stored as an atomic vector. If you have multiple types of data that need to be stored in one dimension, you can use a list. A list is a more flexible version of an atomic vector.
If our data are two dimensional, we’ll be using data frames (or matrices, but this chapter will focus on data frames as they are the more common of the two). Think of a data frame as a collection of atomic vectors. If you try to create a data frame with a vector with elements of different types, that data frame will coerce the vector into the most flexible type (most flexible being character, least flexible being logical).
d<-data.frame(cbind(a=1:3, b=2:4))
class(d$a)
## [1] "integer"
d2<-data.frame(cbind(a=c(1,"a",F), b=c(2,4,F)))
class(d2$b)
## [1] "factor"
levels(d2$a)
## [1] "1" "a" "FALSE"
d$a
## [1] 1 2 3
I know what you’re thinking, this is all very basic, and we know most of this already. Well sure, but it’s all important to note because how you store your data will affect how you index it. Lists, for example, will use different indexing operators for the most part than vectors or data frames. What are these indexing operators you might ask? Well, the three that I will be covering are the []
, [[]]
, and $
operators. I will start with the most basic []
and $
first, and then move on to the somewhat rarer [[]]
.
new.vec<-c(2:7)
One way of pulling values out of this vector would be to ask R for values in certain positions. This is done with the []
operator, like so:
new.vec[1]
## [1] 2
new.vec[1:2]
## [1] 2 3
new.vec[-(1:2)]# adding a negative in front says to include all values except the 1st and 2nd
## [1] 4 5 6 7
new.vec[c(1,4,6)]
## [1] 2 5 7
new.vec[c(T,F,F,T,F,T)] #note we're feeding a vector of boolean, the first value in the vector is "true" so the 1st value in our new.vec will be retained, and so on
## [1] 2 5 7
And if your vector has names, you can pull values out based on those names:
new.vec<-setNames(new.vec, letters[1:6])
new.vec
## a b c d e f
## 2 3 4 5 6 7
new.vec["a"]
## a
## 2
new.vec[c("a","c","d")]
## a c d
## 2 4 5
Okay, okay, that’s all well and good. You may be thinking “My problems are much more complicated than that, I’ve got data sets and variables…etc”. You’re right! We’re getting there.
Data frames, the typical way of organizing several variables’ worth of data, will usually use the []
and $
operators to index and subset. Lets look at a small set of data on happiness ratings of people in the treatment and control group in some experiment, where rows are participants.
my.data<-data.frame(happiness=c(1,4,7,7,6,6), group=c("control","control","control","treatment","treatment","treatment"))
Why don’t we try indexing the first two entries like we did before!
my.data[1:2]
## happiness group
## 1 1 control
## 2 4 control
## 3 7 control
## 4 7 treatment
## 5 6 treatment
## 6 6 treatment
This gave us everything! See, now that we’re in two dimensions, we want to specify the rows and columns we want to extract from our data. The default response their gave us all rows of columns 1 and 2, but lets try some different kinds of extracting that we’ve seen before. Use a comma within the parantheses when indexing in data frames, everything before the commas specifies rows, everything after specifies columns
my.data[1:2,]
## happiness group
## 1 1 control
## 2 4 control
my.data[,1:2]
## happiness group
## 1 1 control
## 2 4 control
## 3 7 control
## 4 7 treatment
## 5 6 treatment
## 6 6 treatment
my.data[1:3,1:2]
## happiness group
## 1 1 control
## 2 4 control
## 3 7 control
my.data[c(T,F,F,T,F,T),c(1,2)]
## happiness group
## 1 1 control
## 4 7 treatment
## 6 6 treatment
my.data["happiness"] # works like our named vector did, gave us all the elements in the happiness variable
## happiness
## 1 1
## 2 4
## 3 7
## 4 7
## 5 6
## 6 6
We can also use the $
operator to select specific variables in our data set and write expressions that must be met in order for data to be included.
my.data[my.data$happiness<6,] #returns all participants whose happiness ratings are less than 6
## happiness group
## 1 1 control
## 2 4 control
my.data[my.data$group=="control",]
## happiness group
## 1 1 control
## 2 4 control
## 3 7 control
my.data[!my.data$group=="control",] # the "!" is essentially saying "not" at the beginning of the expression, so this code returns rows where group is not the control group
## happiness group
## 4 7 treatment
## 5 6 treatment
## 6 6 treatment
my.data[my.data$group=="control" & my.data$happiness<6,] #can use lots of different expressions
## happiness group
## 1 1 control
## 2 4 control
my.data[my.data$group=="control"| my.data$happiness==7,] # the | denotes an or statement
## happiness group
## 1 1 control
## 2 4 control
## 3 7 control
## 4 7 treatment
Now we enter list land, where the [[]]
operator enters the scene. If we try and use the []
operator when working with lists, we’re going to have a bad time. This is because, instead of returning the object inside the list that we want in its natural form, it returns that object as a single element list, which is usually uselist er, I mean useless. Thus, we’ll want to use [[]]
to give us the data type that has been stored in the list. This is all weird and kind of conceptual, so let’s use an example.
my.list<-list(stuff=1, junk=T, things="a") # list of 3
class(my.list[3]) # oh no! this should be a character not a list!
## [1] "list"
class(my.list[[3]])
## [1] "character"
class(my.list$things) #keep in mind you can also use the dollar sign to pull from lists
## [1] "character"
Why might this matter? Well, if you’re trying to pull elements out of a list to use in some function, and that function requires a character or numeric and []
always spits out a list, you won’t be able to run your function. Watch:
new.list<-list(stuff=1:5, junk=T, things="a")
mean(new.list[1]) # we get an error! Just as the prophecy fortold!
## Warning in mean.default(new.list[1]): argument is not numeric or logical:
## returning NA
## [1] NA
mean(new.list[[1]]) # much better
## [1] 3
mean(new.list$stuff) # this also works
## [1] 3
This is a very simple mistake, but can be quite hard to detect. It’s important to know the types of arguments your functions take, and whether your data are in that format or not. This will save you loads of time in the long run.
You’ll notice in the example above, I have a list of 3 where the first entry is a numeric vector, the second entry is a single logical element, and the last entry is a single character element. What if I want to extract one or several of the numbers from that numeric vector?
new.list[[1]][1:3]
## [1] 1 2 3
new.list$stuff[1:3]
## [1] 1 2 3
Now that we know how to take the things we want out of the data structures we have, we need to know where to put these things. In some cases we want to initialize a vector or some other structure that we can store things in. This is especially helpful when you get into programming loops where computational resources are limited. Or, this approach can be used with large amounts of data. Storing data without pre-allocation can become burdensome because R has to rewrite every entry as the data set expands. With pre-allocation, we can do work on large data much faster. Let’s run through an example of how to store things in pre-allocated data structures.
blank.vec<-vector(mode="character", length = 10) # creates a blank character vector of length 10
blank.vec
## [1] "" "" "" "" "" "" "" "" "" ""
blank.vec[1]<-"No longer blank"
blank.vec
## [1] "No longer blank" "" ""
## [4] "" "" ""
## [7] "" "" ""
## [10] ""
We can populate this vector with values we pull from other data as well! Take a look!
blank.vec[2:7]<-my.data$happiness
blank.vec
## [1] "No longer blank" "1" "4"
## [4] "7" "7" "6"
## [7] "6" "" ""
## [10] ""
blank.vec[7]<-my.data$happiness
## Warning in blank.vec[7] <- my.data$happiness: number of items to replace is
## not a multiple of replacement length
blank.vec
## [1] "No longer blank" "1" "4"
## [4] "7" "7" "6"
## [7] "1" "" ""
## [10] ""
Here we assigned the happiness scores on our old data frame to the second through seventh elements of our blank vector. Creating vectors and storing results in them like this is a great way to manage your data in R if you know how long your input is. Otherwise…
blank.vec[2:8]<-my.data$happiness
## Warning in blank.vec[2:8] <- my.data$happiness: number of items to replace
## is not a multiple of replacement length
blank.vec
## [1] "No longer blank" "1" "4"
## [4] "7" "7" "6"
## [7] "6" "1" ""
## [10] ""
blank.vec[7]<-my.data$happiness
## Warning in blank.vec[7] <- my.data$happiness: number of items to replace is
## not a multiple of replacement length
blank.vec
## [1] "No longer blank" "1" "4"
## [4] "7" "7" "6"
## [7] "1" "1" ""
## [10] ""
R spit out a warning here, but it still performed the funciton. If you look in the output in the first example there, you’ll notice that the 8th entry has been written over with the 1st entry from our my.data$happiness
vector. That’s because we tried to assign a vector of length 6 to 7 elements in our vector. Be careful doing this!
Similarly, when we try store a bunch of values in just one entry in our vector, it spits out and error and only uses the first value in the vector we tried to store.
We can also save our data into data frames we create ourselves.
df<-as.data.frame(matrix(ncol=2, nrow=10)) # this creates a data frame with 2 columns and 10 rows
df[,1]<-1:10#filled the first column with the numbers 1 through 10
df
## V1 V2
## 1 1 NA
## 2 2 NA
## 3 3 NA
## 4 4 NA
## 5 5 NA
## 6 6 NA
## 7 7 NA
## 8 8 NA
## 9 9 NA
## 10 10 NA
df[1,]<-1:2 #filled the first row with the numebers 1 and 2
df
## V1 V2
## 1 1 2
## 2 2 NA
## 3 3 NA
## 4 4 NA
## 5 5 NA
## 6 6 NA
## 7 7 NA
## 8 8 NA
## 9 9 NA
## 10 10 NA
names(df)<-c("stuff", "junk")
df[,"junk"]<-c("word", "people", "places", "things","ideas","I", "am", "done", "making","words") # filled the "junk" column with a character vector
df
## stuff junk
## 1 1 word
## 2 2 people
## 3 3 places
## 4 4 things
## 5 5 ideas
## 6 6 I
## 7 7 am
## 8 8 done
## 9 9 making
## 10 10 words
You can also tack data onto a data frame with the $
operator, like so:
df$alphabet<-letters[1:10] # assigns the 1st through 10th letters to a vector "alphabet" and stores that vector as a third column in our data frame
df$dv<-c(1,7,7,4,4,6,6,3,3,3)
df
## stuff junk alphabet dv
## 1 1 word a 1
## 2 2 people b 7
## 3 3 places c 7
## 4 4 things d 4
## 5 5 ideas e 4
## 6 6 I f 6
## 7 7 am g 6
## 8 8 done h 3
## 9 9 making i 3
## 10 10 words j 3
You could also initialize and store data in a list. To do this, you would need only change the mode =
argument in the vector
function to "list"
and store data accordingly.
I will be going through a quick example of how to use some of the indexing techniques we’ve learned on real data sets. The data I’m using is part of the datasets
package in R. My story here is that we’re trying to develop some growth hormone that will make our plants yield tons of fruit. The data set includes two variables, the weight of the yields of the plant, labeled weight, and the treatment given to that plant, labeled group.
df.plant<-PlantGrowth
str(df.plant)
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
df.plant
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
## 7 5.17 ctrl
## 8 4.53 ctrl
## 9 5.33 ctrl
## 10 5.14 ctrl
## 11 4.81 trt1
## 12 4.17 trt1
## 13 4.41 trt1
## 14 3.59 trt1
## 15 5.87 trt1
## 16 3.83 trt1
## 17 6.03 trt1
## 18 4.89 trt1
## 19 4.32 trt1
## 20 4.69 trt1
## 21 6.31 trt2
## 22 5.12 trt2
## 23 5.54 trt2
## 24 5.50 trt2
## 25 5.37 trt2
## 26 5.29 trt2
## 27 4.92 trt2
## 28 6.15 trt2
## 29 5.80 trt2
## 30 5.26 trt2
Oh no! It looks like our research assistants accidentally put some of our radiated Miracle Gro in some of our treatment 2 plants by accident, as the first two entries here yielded huge amounts of fruit. I’m going to add some outliers to our data set that we’ll be able to see.
outliers<-data.frame(weight=c(50,51), group=c("trt2", "trt2")) # creates two plants in treatment 2 that yielded 50 and 51 lbs of fruit respectively
df.plant<-rbind(outliers, df.plant) # binds the data frame with our outliers to the front of our main data frame
head(df.plant)
## weight group
## 1 50.00 trt2
## 2 51.00 trt2
## 3 4.17 ctrl
## 4 5.58 ctrl
## 5 5.18 ctrl
## 6 6.11 ctrl
Look, they’re trying to fix everything now!
We probably don’t want to include these outliers in any of our analyses. I’m going to exclude any participants from our data set that are 3 standard deviations above our mean in terms of the weight of their yield.
outlier.value<-mean(df.plant$weight)+3*sd(df.plant$weight)
df.plant.trim<-df.plant[df.plant$weight<outlier.value,]
head(df.plant.trim)
## weight group
## 3 4.17 ctrl
## 4 5.58 ctrl
## 5 5.18 ctrl
## 6 6.11 ctrl
## 7 4.50 ctrl
## 8 4.61 ctrl
Suppose we want to compare group means for each treatment before we want to do any formal analyses. We could extract the means and standard deviations for each group and throw them in a pre-allocated data frame!
group.data<-as.data.frame(matrix(ncol=3, nrow=3))
names(group.data)<-c("treatment.group", "means", "sds")
group.data[,"treatment.group"]<-c("ctrl","trt1","trt2")
group.data
## treatment.group means sds
## 1 ctrl NA NA
## 2 trt1 NA NA
## 3 trt2 NA NA
#Pulls out the weight and group for the control, treatment 1, and treatment 2 groups respectively
dat.ctrl<-df.plant.trim[df.plant.trim$group=="ctrl",]
dat.trt1<-df.plant.trim[df.plant.trim$group=="trt1",]
dat.trt2<-df.plant.trim[df.plant.trim$group=="trt2",]
#the function "subset" also accomplishes the same goal if you feed it first your data and second the group you wish to extract or keep
df.plant.subset.ctrl<-subset(df.plant.trim,df.plant.trim$group=="ctrl")
#Stores the means and standard deviations of each group in a data frame
group.data[,"means"]<-c(mean(dat.ctrl$weight),mean(dat.trt1$weight), mean(dat.trt2$weight))
group.data[,"sds"]<-c(sd(dat.ctrl$weight),sd(dat.trt1$weight), sd(dat.trt2$weight))
group.data
## treatment.group means sds
## 1 ctrl 5.032 0.5830914
## 2 trt1 4.661 0.7936757
## 3 trt2 5.526 0.4425733
Or, if it turns out our research assistants put club soda in as treatment 2, we may just want to subset our data to only include the control and treatment groups, like so.
df.plant.subset.notrt2.1<-subset(df.plant.trim,!(df.plant.trim$group=="trt2"))
df.plant.subset.notrt2.2<-df.plant.trim[df.plant.trim$group=="ctrl"|df.plant.trim$group=="trt1",]
Hopefully now you have a slightly better understanding of how to access, pull out, and store data in/from different data structures. Thank you, and good night.
Wickham, H. (2014). Advanced R. CRC Press.