# 1 Setup

NOTE ON PACKAGES: I use the following packages in this chapter. If you do not have these packages installed, you may use the script below to install them and load their libraries. I use the kable function of the “knitr” package to create a table in section 5.2 B, and I use the dcast function of the “reshape2” package to restructure that table, also in section 5.2 B.

#install.packages("reshape2")
#install.packages("knitr")
library(reshape2)
library(knitr)

Before beginning, you may set your working directory if you desire:

#setwd("C:/Users/Felix/Desktop/R Working Directory")

# 2 The rnorm Function and Basic Simulation

In this section, I will introduce the rnorm function and provide basic use cases in the context of simulating observations drawn from a normally distributed population.

The rnorm function is used to generate a specified number of “samples” from a normally distributed population with a user-specified mean and standard deviation. As such, the rnorm function is often used to simulate empirical data collected from a population with known mean and standard deviation. The rnorm function uses the following syntax and arguments:

rnorm(n, mean = 0, sd = 1)

Where n is the number of “observations” to be generated, “mean” is the user-specified population mean, and “sd” is the user-specified standard deviation of the population. To begin to show the usefulness of this function, I will introduce a few basic examples.

## 2.1 A. Using rnorm(): Basic Examples and the set.seed() Function.

Let’s use the rnorm function to simulate 10 observations from a population with a mean of 15 and a standard deviation of 2. I will store these 10 generated values in an object I call “rnorm.example” in order to be able to find the mean of these values (and potentially perform other manipulations and calculations on them later on).

rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   12.63554 14.68902 17.56492 12.15005 17.01833 15.01955 13.22323
##   15.45385 16.45164 12.63038
mean(rnorm.example)
##  14.68365

It is important to note that while these observations are drawn from a normally distributed population with a mean of 15 and a standard deviation of 2, the mean of this sample is NOT 15, as seen above. The rnorm function does not create a normally distributed sample with specified mean, sd, and number of observations - it draws observations from a theoretical population with specified parameters.

Note that performing the same rnorm operation will result in a different set of values being generated each time:

rnorm(10, mean = 15, sd=2)
##   12.89967 18.11219 12.27217 12.36293 14.21295 16.19919 14.48325
##   12.78139 14.30000 14.81097
rnorm(10, mean = 15, sd=2)
##   15.122720 15.346049 13.362262  9.892635 14.313268 17.725910 14.653110
##   17.224562 14.288682 13.297050

By contrast, running the named object “rnorm.example” repeatedly returns the same set of values each time:

rnorm.example
##   12.63554 14.68902 17.56492 12.15005 17.01833 15.01955 13.22323
##   15.45385 16.45164 12.63038
rnorm.example
##   12.63554 14.68902 17.56492 12.15005 17.01833 15.01955 13.22323
##   15.45385 16.45164 12.63038

However, re-running the original expression and then calling the object will result in the values of “rnorm.example” being changed each time the expression is re-run:

rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   15.127496 17.443617 16.709997 14.789930 13.768781  9.610574 12.011626
##   15.164009 17.854578 13.936753
rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   17.85742 16.76920 11.97688 19.16544 14.06496 16.72299 18.01710
##   13.43506 13.06632 12.10678

As such, if you want to obtain same set of numbers each time you run a chunk of code, you must use the set.seed() function.

If you want to generate the same numbers each time you run an rnorm expression, you can use the set.seed function, which uses the following syntax: set.seed(seed)

In which “seed” is an integer that sets the seed, such that specifying that same integer later will recall the particular set of values associated with that seed integer.

To see how setting a given seed will produce the same values every time that seed is specified, even if the expression is re-run, let’s reproduce the previous code chunk but specify a seed of “1” for “rnorm.example”:

set.seed(1)
rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   13.74709 15.36729 13.32874 18.19056 15.65902 13.35906 15.97486
##   16.47665 16.15156 14.38922
rnorm.example<-rnorm(10, mean=15, sd=2)
set.seed(1)
rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   13.74709 15.36729 13.32874 18.19056 15.65902 13.35906 15.97486
##   16.47665 16.15156 14.38922

Unlike in the previous chunk of code, this time re-running the expression does not change the numbers in “rnorm.example”, since we set a seed. What happens when we specify a seed other than 1?

set.seed(2)
rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   13.20617 15.36970 18.17569 12.73925 14.83950 15.26484 16.41591
##   14.52060 18.96895 14.72243
set.seed(2)
rnorm.example<-rnorm(10, mean=15, sd=2)
rnorm.example
##   13.20617 15.36970 18.17569 12.73925 14.83950 15.26484 16.41591
##   14.52060 18.96895 14.72243

Setting a seed of 2 will create a different set of 10 generated values, which can be reproduced each time the seed 2 is specified. Thus, different seed numbers can be used to store different sets of values for the same expression. The seed numbers I use throughout this chapter were chosen by me at random and hold no symbolic meaning whatsoever. In the case of this chapter, the set.seed() function is useful as it ensures that that, whenever I specify a seed, the same numbers will appear each time the full chapter code is run.

## 2.2 B. Using rnorm for Basic Simulation

Let’s use the rnorm function to simulate some data in a real-world context!

To set the stage for the scenario I want to simulate, I will first need to introduce a new construct:

“[REDACTED], it won’t work!”.

This construct describes the mixture of frustration, despair, and resignation that students in R courses experience when they encounter a perplexing problem in R, whether it be packages that refuse to install, functions that yield inscrutable error messages, or code that runs perfectly on a neighboring computer but stubbornly refuses to run on yours. We can use the rnorm function to simulate the number of instances of this experience over the course of a semester-long R class.

More specifically, we will use rnorm to simulate the data we might collect in surveying the number of instances of “[REDACTED], it won’t work!” that Mac and PC users might encounter over the course of the first and second half of a semester-long R course. Generally, when simulating data, we either [WHAT GOES HERE??? A WORD OR PHRASE IS MISSING] on known population parameters or on theoretically driven assumptions regarding the population parameters. The data I will generate are based on the following assumptions (derived from informal observation of my R class) regarding what observations would look like:

1. Instances of “[REDACTED], it won’t work!” encountered during R use may be caused either by user error/unfamiliarity with the R program or by R program bugs/compatibility issues.

2. As students become more familiar with the R program, user error decreases, leading to fewer instances of “[REDACTED], it won’t work!” stemming from user error for both Mac and PC users in 2nd half of semester compared to 1st half.

3. R studio is better-optimized for PC than for Mac, leading to more instances of “[REDACTED], it won’t work!” caused by program errors and compatibility issues for Mac users than for PC users.

Based on these assumptions, given instances of “[REDACTED], it won’t work!” as the dependent variable, I would expect to observe a main effect of Time (more instances in 1st half than 2nd half of semester) and of User Type (more instances for Mac users than PC users). Let’s use the rnorm function to simulate this pattern of results. Note that other factors that might influence the dependent variable, such as individual R proficiency and psychological effects of R course professors frequently disparaging Macs, are not modeled here.

set.seed(103)
MacUsers1stHalf <- rnorm(100, mean = 500, sd = 35)
PCUsers1stHalf <- rnorm(100, mean = 450, sd = 35)
MacUsers2ndHalf <-rnorm(100, mean = 150, sd = 35)
PCUsers2ndHalf <- rnorm (100, mean = 100, sd=35)

As you see in the script above, I set the averages to simulate the hypothesized pattern of results. As I had no predictions regarding the distribution of data, I have no strong theoretical rationale for setting a standard deviation of 35. However, keeping the default standard deviation value of 1 would have yielded unnatural-looking simulated data, as the distribution of data would have been too tightly clustered around the mean. I used the set.seed function to ensure that the same values will be generated each time this code is run. Again, why not just use the object like you did with rnorm.example? Why always use set.seed? If you give the explanation before this mention of it, I wouldn’t be questioning it down here. The next section discusses the rep function. Once we have learned the basics of the rep function, we will be able to do more with the observations we have simulated.

# 3 The rep Function and its Applications

In this section, I will introduce the rep function and provide a basic use case in the context of creating a data frame from the simulated data we have generated. The rep function replicates the values in a given object (a vector or a function) a specified number of times. The rep function uses the following syntax and arguments:

        rep(x, ...)


Where “x” is the value to be replicated or the vector containing values to be replicated, and “…” denotes additional arguments, such as “times=” (which specifies the number of times to replicate “x”), “each=” (which specifies the number of times to repeat each element of “x”), and “len=” (which specifies the desired length of an output vector). Below, I go over some very basic examples of each of these uses of the rep function.

## 3.1 A. Examples of simple replication of individual values and vectors

#### 3.1.0.1 The simplest use of the replicate function is to replicate a text or numeric value a set number of times. To begin, let’s replicate the number 5 ten times:

rep(5, 10)
##   5 5 5 5 5 5 5 5 5 5

As you see above, entering a value n such that the sytax reads “rep(x,n)” defaults to the same meaning as the syntax “rep(x, times=n)”. Aside from individual values, we can also replicate a text label in the same manner:

rep("Kitty Cat", 10)
##   "Kitty Cat" "Kitty Cat" "Kitty Cat" "Kitty Cat" "Kitty Cat"
##   "Kitty Cat" "Kitty Cat" "Kitty Cat" "Kitty Cat" "Kitty Cat"

One possible use for replicating individual values or text labels is populating columns for continuous or categorical data in a data frame. I explore this use in section 5.2 B below.

Aside from replicating individual text or numeric values, the rep function is commonly used to replicate (elements within) vectors. Below, I create a simple vector of seven numeric values and then replicate that vector, as a whole, 3 times.

ExampleRepVector<-c(1,2,5,7,7,8,10)
rep(ExampleRepVector, 3)
##    1  2  5  7  7  8 10  1  2  5  7  7  8 10  1  2  5  7  7  8 10

We can also replicate each element of this vector a set number of times. Below, I first replicate each element of the vector twice. Next, I replicate each element twice, while also replicating the vector as a whole (with each element present twice) three times.

rep(ExampleRepVector, each=2)
##    1  1  2  2  5  5  7  7  7  7  8  8 10 10
rep(ExampleRepVector, times=3, each=2)
##    1  1  2  2  5  5  7  7  7  7  8  8 10 10  1  1  2  2  5  5  7  7  7
##   7  8  8 10 10  1  1  2  2  5  5  7  7  7  7  8  8 10 10

Finally, let’s see what happens when we specify an output vector of a different length than the number of terms in the original vector (7):

20 terms:

rep(ExampleRepVector, len=20)
##    1  2  5  7  7  8 10  1  2  5  7  7  8 10  1  2  5  7  7  8

5 terms:

rep(ExampleRepVector, len=5)
##  1 2 5 7 7

When outputting to a vector longer than the original vector, the original vector is replicated until the output vector has been filled. In our case, filling the 20 term output vector requires the original vector to be replicated twice, with an incomplete 3rd replication of only the first six terms.

When outputting to a vector shorter than the original vector, the output vector is filled before all of the values of the original vector can be replicated. In our case, the original vector has 7 values and the output vector only has room for 5, so the last 2 values are not replicated.

## 3.2 B. Using the rep Function to Create Data Frames

#### 3.2.0.1 Now that we’ve gone over the basic uses of the rep function, let’s use the rep function to help us create a data frame for the simulated data we created in section 5.1 . Recall that we have one between-subjects variable, UserType (Mac user or PC user), and one within-subjects variable, Time (1st and 2nd Half of Semester). As we have 100 Mac Users and 100 PC users, our total number of subjects is 200.

In creating the data frame, I will use the rep function to replicate the labels of each level of each IV across the relevant range of subjects. Below, I set these arguments and create the data frame:

n.MacUsers=100
n.PCUsers=100
N=n.MacUsers + n.PCUsers

SimulatedData <- data.frame(
Subject = c(seq(1,N),seq(1,N)),
Time = c(rep("1st Half", N), rep("2nd Half", N)),
UserType = c(rep(c(rep("Mac User", n.MacUsers), rep("PC User", n.PCUsers)),2)),
Instances = c(MacUsers1stHalf,PCUsers1stHalf,
MacUsers2ndHalf,PCUsers2ndHalf

)
)

Using the rep function to denote which level of the within- and between-subjects IV each data point is associated with, as in the above data frame syntax, allows us to see which values correspond to Mac Users and PC Users, and to the 1st Half and 2nd Half of the semester. Let’s make a table of these data. As there are a total of 400 columns in the full table, I will be using the head() function to display only the first several columns. I will be using the kable function in the knitr package to produce this table; further documentation and help for this package can be found at the following URL: http://yihui.name/knitr/

knitr::kable(head(SimulatedData))
Subject Time UserType Instances
1 1st Half Mac User 472.4909
2 1st Half Mac User 501.9159
3 1st Half Mac User 458.9604
4 1st Half Mac User 494.1441
5 1st Half Mac User 434.7239
6 1st Half Mac User 495.7846

This table could be more easily interpreted if the two levels of Time, 1st Half and 2nd Half, were represented on separate columns. We can use the dcast() function (part of the “reshape2” package) to change the structure of the data frame and then create a new table. Below, I use the dcast function to restructure the data frame by splitting the variable time into two columns corresponding to the 2 levels of the variable: First Half and Second Half. Again, I use the head() function to display only the first several columns of a table that has 200 columsn in total. Further documentation and help with the reshape 2 package can be found at the following URL: http://seananderson.ca/2013/10/19/reshape.html

SimulatedData.Wide<-dcast(SimulatedData, Subject + UserType ~ Time, value.var="Instances")
knitr::kable(head(SimulatedData.Wide))
Subject UserType 1st Half 2nd Half
1 Mac User 472.4909 142.7769
2 Mac User 501.9159 154.4736
3 Mac User 458.9604 111.5123
4 Mac User 494.1441 189.0489
5 Mac User 434.7239 111.8786
6 Mac User 495.7846 138.1905

As you can see, the variable “Time” has now been split into two columns, “1st Half” and “2nd Half”. To see the full table, either with or without the split, view the object in the Global Environment.

# 4 The sample Function

In this section, I will introduce the sample function, provide examples of basic use, and discuss sampling with and without replacement and bootstrapping. The sample function is most commonly used to take a sample of the elements of a vector, which can be done either with or without replacement. The sample function uses the following syntax and arguments:

sample(x, size, replace = FALSE)

Where “x” is a vector of elements to choose from, “size” is the number of elements to be sampled, and “replace=” can be set to either “TRUE” or “FALSE” depending on whether you wish to sample with or without replacement. Specifying replace = “TRUE” will samle with replacement, whereas specifying replace = “FALSE” will sample without replacement. Below, I go over basic uses of the “sample” function to sample with and without replacement.

## 4.1 A. Basic Examples Using the sample Function

First, let’s create a vector of values which does not include any repeated elements. This will make it easier to see the difference between sampling with and without replacement.

ExampleSampleVector<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

Let’s sample fifteen values from this vector WITHOUT replacement

sample(ExampleSampleVector,15,replace = FALSE)
##   18  4  6  8  3 13 14  5 10 12 15 16  9 20  7

As you can see, just as no value appeared in the original vector more than once, no value appears in our sample more than once. Sampling without replacement means that once a value has been sampled, it is “removed” from the original vector and cannot be randomly sampled again.

Now, let’s sample fifteen values from this vector WITH replacement

sample(ExampleSampleVector,15,replace = TRUE)
##   11 11 15  6 17  9  1 14 10 16  8 13 15 18 13

This time, it is likely that there is at least one duplicate value in our sample. Sampling with replacement means that sampled values are “replaced” in the original vector and thus may be randomly sampled again.

When drawing random samples of a few observations, a different set of observations will generally be sampled each time you run the sample function. If you want to keep the same set of sampled observations in order to perform further manipulations using that particular set, you can use the set.seed function here as well:

set.seed(2); sample(ExampleSampleVector, 15)
##    4 14 11  3 16 15  2 18  6  7 12 17 13 19  9
set.seed(2); sample(ExampleSampleVector, 15)
##    4 14 11  3 16 15  2 18  6  7 12 17 13 19  9

As you can see, the two sampled sets are identical, thanks to the set.seed function. I can use this seed to recall this particular set of fifteen sampled values if I want to use them in the future.

## 4.2 B. Using the sample Function and Bootstrapping on our Simulated Mac and PC User Data

Now, we will discuss sampling in the context of our simulated data, and introduce the concept of bootstrapping. Bootstrapping involves resampling from a population or set of values a specified number of times in order for the mean of your samples to more closely approximate the mean of the population or original set of values. While there are several ways to perform bootstrapped sampling in R, the simplest is to use the replicate function, which I will discuss only as it pertains to bootrapping. The replicate function repeats a given function a specified number of times, using the following syntax and arguments:

replicate(n, expr)

Where n is the number of desired replications and “expr” is the expression to be replicated, usually an object already entered in the global environment.

Boostrapping is often used when sampling from a normally distributed population. Here, I discuss sampling and bootstrapping samples from a larger set of observations. In the next section, 5.4, I discuss sampling from a population in the context of demonstrating the Law of Large Numbers.

Let’s sample ten values from our data corresponding to Mac Users in the first half of the semester, and compare the mean of that sample to the mean of the dataset being sampled from:

mean(MacUsers1stHalf)
##  500.3287
set.seed(14)
MacUsersSample.1
##   525.9583 451.4742 453.5782 509.6634 482.8383 465.0211 451.4292
##   509.0998 507.0654 499.6237
mean(MacUsersSample.1)
##  485.5752

A random sample of ten values is not always representative of the larger set from which it is drawn. In this case, we have randomly sampled ten values from the 100 values that represent the number of instances of “[REDACTED], it won’t work!” of 100 Mac Users over the first half of semester. We know that the average number of instances for this group is 500.33, but the average of our ten sampled values is roughly 486. We can improve the accuracy of the mean estimate obtained through sampling by using bootstrapping.

set.seed(993)
TenSampleBootstrapNR<-replicate(10, sample(MacUsers1stHalf, 10, replace = FALSE))
TenSampleBootstrapNR
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##  [1,] 482.8383 474.9895 458.9604 504.5545 553.3036 466.6709 532.2691
##  [2,] 501.3101 494.6208 467.3723 446.8776 446.8776 467.3774 553.3036
##  [3,] 510.7601 537.5814 482.2247 483.4794 528.7878 511.9421 479.1027
##  [4,] 504.0899 527.7317 479.1027 472.4909 501.3101 516.9374 494.1441
##  [5,] 494.1441 501.3101 494.1441 464.3050 498.2764 461.7992 484.4387
##  [6,] 483.4794 561.6610 516.2847 509.0998 522.1641 527.7317 457.2586
##  [7,] 522.1641 505.9178 490.5675 560.1585 486.8711 535.5993 458.9604
##  [8,] 434.7239 536.5534 522.9261 504.0382 458.3491 494.1441 494.6208
##  [9,] 466.6709 501.1626 458.3491 495.7846 511.9421 536.5534 561.6610
## [10,] 461.7992 491.9022 448.3091 529.8700 497.9852 465.0211 453.4769
##           [,8]     [,9]    [,10]
##  [1,] 467.3774 453.4769 464.0428
##  [2,] 522.9261 479.1027 541.6556
##  [3,] 569.8026 463.1552 482.7156
##  [4,] 502.8677 446.8776 510.7601
##  [5,] 458.9604 495.0046 509.0998
##  [6,] 546.8735 505.9178 569.8026
##  [7,] 560.8189 448.3091 515.2273
##  [8,] 501.3091 465.0211 516.2847
##  [9,] 466.7099 524.3736 465.0211
## [10,] 497.9852 451.4742 424.9227
MeanTenSampleBootstrapNR<-replicate(10, mean(sample(MacUsers1stHalf, 10, replace = FALSE)))
MeanTenSampleBootstrapNR
##   486.6228 507.5105 496.0812 505.0298 491.6476 506.2274 489.8495
##   496.4604 489.9878 491.1184
mean(MeanTenSampleBootstrapNR)
##  496.0535

Taking 10 bootrapped samples of ten values, without replacement each yields an average of roughly 496 - an improvement. Let’s try the same thing for sampling with replacement:

set.seed(82)
TenSampleBoostrapWR<-replicate(10, sample(MacUsers1stHalf,10,replace = TRUE))
TenSampleBoostrapWR
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##  [1,] 501.1626 515.2273 524.3736 504.0899 424.9227 499.6237 560.4686
##  [2,] 461.7992 501.9159 525.9583 537.5814 527.7317 516.0727 467.3774
##  [3,] 535.5993 465.0211 511.9421 477.8833 453.5782 501.1626 461.7992
##  [4,] 507.0654 479.1027 516.9374 479.1027 451.4742 537.5814 458.3491
##  [5,] 544.4890 504.0382 465.0211 463.1552 465.0211 509.6634 494.6208
##  [6,] 532.2691 490.5675 461.7992 532.2691 491.4081 560.1585 479.1027
##  [7,] 461.7992 525.9583 544.4890 505.9178 541.6556 510.7601 504.7141
##  [8,] 509.1655 504.7141 516.9374 511.0911 451.4549 482.7156 509.1655
##  [9,] 560.8189 472.8796 491.9022 497.9852 467.3774 451.4549 463.1552
## [10,] 491.4081 472.4909 468.9095 482.7156 434.7239 484.1997 522.1641
##           [,8]     [,9]    [,10]
##  [1,] 528.7878 511.0911 446.8776
##  [2,] 524.3736 486.8711 453.4769
##  [3,] 527.7317 446.8776 501.1626
##  [4,] 484.4387 477.8833 466.6709
##  [5,] 509.1655 525.9583 466.6709
##  [6,] 491.4081 528.7878 510.7601
##  [7,] 474.9895 467.3723 498.8165
##  [8,] 463.1552 453.5452 497.9852
##  [9,] 434.7239 536.5534 504.2742
## [10,] 483.4794 453.5782 528.7878
MeanTenSampleBootstrapWR<-replicate(10, mean(sample(MacUsers1stHalf,10,replace = TRUE)))
MeanTenSampleBootstrapWR
##   514.7425 498.8315 491.6189 494.2262 497.2734 497.1704 498.6680
##   502.5443 503.0764 492.2138
mean(MeanTenSampleBootstrapWR)
##  499.0365

Here, our mean is roughly 499. Again, this is an improvement over taking only one sample. However, we can guarantee ever more accurate estimates of the mean by taking even more bootstrapped samples, as seen in the following example.

Now, let’s try estimating the mean based on 100 bootstrapped samples of ten observations each, first without and then with replacement:

set.seed(90)
HundredSampleBootstrapNR<-replicate(100, sample(MacUsers1stHalf, 10, replace = FALSE))
MeanHundredSampleBootsrapNR<-replicate(100, mean(sample(MacUsers1stHalf, 10, replace = FALSE)))
MeanHundredSampleBootsrapNR
##    483.9443 482.1300 501.8791 497.3724 521.8733 495.3230 504.2639
##    502.6423 490.4462 499.9054 505.3837 496.8223 498.4532 504.4094
##   493.6311 484.4697 506.3115 507.2337 502.4622 506.0444 493.9876
##   479.9301 505.4607 485.3938 499.5878 498.2262 511.7374 501.2629
##   500.0467 487.8291 504.6782 514.9830 499.7610 485.8169 516.1472
##   491.6890 499.5895 505.9952 484.8721 505.3810 516.2393 501.0103
##   494.4457 503.0997 497.2621 515.9588 508.0401 501.7428 516.8377
##   500.4021 500.2011 500.6698 507.9374 516.0995 511.6492 497.5592
##   512.4951 499.4033 483.3637 496.6662 511.4488 499.6909 503.0627
##   511.9006 485.8086 489.7278 499.6084 493.6650 497.9322 494.2967
##   492.8505 500.1848 501.2082 513.1225 497.2403 484.4841 514.7898
##   495.1924 501.3294 498.0645 490.8601 496.2211 487.8071 497.7390
##   502.5861 498.2383 472.8172 508.5017 498.8662 500.0314 494.8469
##   482.0513 505.2282 500.5006 501.6754 513.7240 513.7879 491.4527
##   496.3572 497.7718
mean(MeanHundredSampleBootsrapNR)
##  499.7513
set.seed(92)
HundredSampleBootstrapWR<-replicate(100, sample(MacUsers1stHalf, 10, replace = TRUE))
MeanHundredSampleBootstrapWR<-replicate(100, mean(sample(MacUsers1stHalf, 10, replace = TRUE)))
MeanHundredSampleBootstrapWR
##    501.6860 497.1856 499.6808 495.2185 524.2279 494.7761 507.2499
##    490.6339 516.4108 478.7056 501.4649 500.1753 496.9795 498.5235
##   489.4741 506.9105 501.8296 483.6765 500.2106 492.0935 502.8805
##   501.4384 491.0846 487.5533 494.4862 491.7782 495.9216 499.3074
##   493.5227 494.7585 490.3979 502.3776 498.2886 483.9435 502.6134
##   499.6300 507.9872 501.9230 493.7481 500.8062 492.6423 490.5927
##   510.2734 481.4542 498.5302 515.0135 505.5613 493.5074 515.0532
##   500.5499 503.1455 498.9470 492.1895 499.8119 482.2462 514.1567
##   498.2753 508.2647 488.7334 508.0879 490.0728 493.2714 490.4271
##   498.7075 496.0585 507.2410 502.0665 497.6320 503.8993 509.9675
##   479.2922 512.2531 504.1145 494.1574 506.8482 507.0131 489.9587
##   503.6567 478.6168 504.7695 482.8386 500.7073 498.8247 512.1134
##   499.9993 483.4650 510.6256 489.4198 508.0756 508.6300 504.6548
##   513.8521 497.0976 497.8362 501.1025 504.4786 504.7135 520.8636
##   513.7015 492.8148
mean(MeanHundredSampleBootstrapWR)
##  499.2443

Taking 100 bootstrapped samples further improves our estimate of the mean of the group we are sampling from. Now, our mean is very close to 500, the true value. The larger the size of the population being sampled from, the less of a difference there is between sampling with and without replacement. In our case, as there were 100 values in the group being sampled from and we sampled ten values at a time, it didn’t make a large difference whether we sampled with or without replacement.

Seeing as the rnorm function generates a sample of a specified number of values from a hypothetical normal distribution with a user-specified mean and standard deviation, bootstrapping simulations are often performed using the rnorm function as well.

# 5 Applications to General Statistical Problems

## 5.1 A. Demonstrating the Law of Large Numbers

Acknowledgment: This section drew inspiration from a demonstration of the Law of Large numbers created by Pedro F. Quintana-Ascencio and James Angelo at UCF, which can be found at the following URL: http://pascencio.cos.ucf.edu/classes/Methods/R%20demos/R%20Demonstration%20Law%20Large%20numbers%202012.pdf

The Law of Large Numbers states that as the number of observations in a given random sample of the population increases, the mean of that sample approaches the true mean value of the population.

A simple corollary can be made to the act of flipping a coin. We know that the true rate of each outcome, heads or tails, is 0.5. However, if we flip a coin five times, we could easily obtain a result of four heads and 1 tails, yielding a rate of 0.8 for heads and 0.2 for tails. In this case, the mean estimated from a sample of five would be highly innaccurate. However, if we flip a coin one thousand times, it is virtually impossible to obtain such an inaccurate estimate of the true rate of each outcome. Flipping a coin one thousand times would yield a rate very close to 0.5 for both heads and tails.

Let’s demonstrate the Law of Large Numbers in R. To begin, we need to establish a population to sample from. I chose to use the distribution of Japanese mens’ heights as my population, as Japan was the most prominent country for which I could find both a mean and a standard deviation for height (see http://www.mext.go.jp/b_menu/toukei/001/022/2004/002.pdf). Below, I enter the mean of the population, 172 centimeters, as an object “mu”, and the standard deviation, 5.42 centimeters, as an object “sigma”. Mu is the Greek letter used to denote population mean and sigma is the Greek letter used to denote population standard deviation

mu = 172
sigma = 5.42

To demonstrate the Law of Large numbers, I will sample from this population 50 times, with the number of observations per sample increasing from 1 to 50 in increments of one. Once all of the samples have been collected, I will plot the average of each sample to observe whether the distribution of means changes as the number of samples increases.

First, I will create an empty vector of length 50, which will eventually be populated by the means of my 50 samples:

Vector_of_Means <- rep(0,50)

Checking the vector, we can see that it is currently populated by 50 zeroes:

Vector_of_Means
##   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Next, I need to start populating the vector. I will construct the vector such that the 1st value will be the mean of a sample of 1 observation, the 2nd value will be the mean of a sample of 2 obsrvations, etc. - all the way to the 50th value which is the mean of a sample of 50 observations. To begin, I need to take a sample of one, find the mean, and enter the mean into the vector of averages. I combine these steps into one function below:

Vector_of_Means<-mean(rnorm(1,mean=mu,sd=sigma))

Now, I have to repeat the process for samples of 2 to 50 values to populate the 2nd through the 50th slot in the vector. To do this, I could write 49 more expressions like the one above, but a more elegant solution is to use a “for loop to accomplish the same thing. For help with understanding different types of loops in R, see Chapter 3: Loops and Logicals. In the loop below, I am specifying that for a value of i from 2 to 50 (the length of the vector), the ith value of the vector should be the mean of a sample of i values, with the mean and sd establised previously:

for (i in 2:length(Vector_of_Means)) {
Vector_of_Means[i] <- mean(rnorm(i, mean=mu, sd=sigma))
}

Checking the vector, we can see that it is now populated by the 50 values we have generated:

Vector_of_Means
##   172.4067 174.4181 166.8237 173.8780 174.8917 169.7871 175.6439
##   171.2992 174.9572 171.1308 172.4583 172.1380 170.9901 174.8588
##  171.2705 173.6047 171.6975 172.3292 171.5491 172.9344 172.4017
##  171.0498 170.7853 172.5667 172.7151 173.8499 171.3387 171.9326
##  172.0962 171.6051 171.9281 172.2289 173.1625 172.1153 172.4185
##  172.1944 173.4946 173.1161 172.6899 171.9070 173.1682 170.0475
##  172.6212 173.1910 171.4739 172.0959 172.3849 172.6251 172.7516
##  171.6788

Finally, I create a scatterplot of the means, with Sample Size on the x axis and Centimeters on the Y axis. The blue line indicates the true mean of the population, 172 centimeters. For help with graphing and scatterplots, see Chapter 10: GGplot 1 (Scatterplots and Density).

plot(seq(1,50), Vector_of_Means, xlab = "Sample Size", ylab = "Centimeters")
abline(h=mu, col="blue") From the scatterplot, it is clear that the distribution of points (sample means) clusters closer and closer to the true population mean as sample size increases, demonstrating the Law of Large Numbers. Increasing sample size further would continue this trend.

## 5.2 B. Simulating a Poisson Distribution

A Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a given interval of time, with the requirement that these events occur with a known average rate and that the probability of each event is independent of the probability of the previous events. In a Poisson distribution the variable lambda, sometimes called the rate parameter, represents the average number of events in a given time interval.

The rpois() function can be used to simulate a Poisson distribution in R using the following syntax:

rpois(n, lambda)

Where n is the number of random Poisson numbers to generate and lambda is the aforementioned rate parameter of the distribution.

Let’s simulate a Poisson distribution for a real world scenario. We can use a Poisson distribution to simulate the number of emails that a typical UIC psychology graduate student receives each day (including UIC MASSMAIL) over a 100 day period. Because the probability of each email received is usually independent of previous emails received, a Poisson distribution is a good choice to model this data. Let’s simulate data for a typical student for 100 days, assuming an average rate of 30 emails per day (lambda=30):

PoissonEmails<-rpois(100, 30)

To visualize these data, let’s make a scatterplot:

plot(PoissonEmails,xlab="Day Number", ylab = "Number of Emails Received")
abline(h=30, col="blue") The plot above shows the simulated data representing the number of emails received each day over the 100 day period. The blue line represents the average rate of 30 which I set for this distribution. As expected, data points for specific days fall both above and below this specified average daily rate.