Lesson 99 – The Two-Sample Hypothesis Tests in R

Over the past seven lessons, we equipped ourselves with the necessary theory of the two-sample hypothesis tests.

Lessons 92 and 93 were about the hypothesis test on the difference in proportions. In Lesson 92, we learned Fisher’s Exact Test to verify the difference in proportions. In Lesson 93, we did the same using the normal distribution as a limiting distribution when the sample sizes are large.

Lessons 94, 95, and 96 were about the hypothesis test on the difference in means. In Lesson 94, we learned that under the proposition that the population variances of two random variables are equal, the test-statistic, which uses the pooled variance, follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom. In Lesson 95, we learned that Welch’s t-Test could be used when we cannot make the equality of population proportions assumption. In Lesson 96, we learned the Wilcoxon’s Rank-sum Test that used a ranking approach to approximate the significance of the differences in means. This method is data-driven, needs no assumptions on the limiting distribution, and works well for small sample sizes.

Lesson 97 was about the hypothesis test on the equality of variances. Here we got a sneak-peak into a new distribution called F-distribution, which is the converging distribution of the ratio of two Chi-square distributions divided by their respective degrees of freedom. The test-statistic is the ratio of the sample variances, which we can verify against an F-distribution with n_{1}-1 numerator degrees of freedom and n_{2}-1 denominator degrees of freedom.

All these hypothesis tests can also be done using the bootstrap approach, which we learned in Lesson 98. It uses the data at hand to generate the null distribution of any desired statistic as long as it is computable from the data. It is very flexible. There is no need to make any assumptions on the data’s distributional nature or the limiting distribution for the test-statistic.

It is now time to put all of this learning to practice.


The City of New York Department of Sanitation (DSNY) archives data on the monthly tonnage of trash collected from NYC residences and institutions.

To help us with today’s workout, I dug up our gold mine, the “Open Data for All New Yorkers” page, and found an interesting dataset on the monthly tonnage of trash collected from NYC residences and institutions.

Here is a preview.

The data includes details on when (year and month) DSNY collected the trash, from where (borough and community district), and how much (tonnage of refuse, source-separated recyclable paper, and source-separated metal, glass, plastic, and beverage cartons). They also have other seasonal data such as the tonnage of leaves collected in November and December and tons of Christmas trees collected from curbside in January.

For our workout today, we can use this file named “DSNY_Monthly_Tonnage_Data.csv.” It is a slightly modified version of the file you will find from the open data page.

My interest is in Manhattan’s community district 9 — Morningside Heights, Manhattanville, and Hamilton Heights.

I want to compare the tonnage of trash collected in winter to that in summer. Let’s say we compare the data in Manhattan’s community district 9 for February and August to keep it simple.

Are you ready?

The Basic Steps

Step1: Get the data

You can get the data from here. The file is named DSNY_Monthly_Tonnage_Data.csv. It is a comma-separated values file which we can conveniently read into the R workspace. 

Step 2Create a new folder on your computer

And call this folder “lesson99.”
Make sure that the data file “DSNY_Monthly_Tonnage_Data.csv” is in this folder.

Step 3Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson99” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson99_code.R

Step 4Choose your working directory

“lesson99” is the folder where we stored the code and the input data file. Use setwd("path") to set the path to this folder. Execute the line by clicking the “Run” button on the top right. 

setwd("path to your folder")

Or, if you opened this “lesson99_code.R” file from within your “lesson99” folder, you can also use the following line to set your working directory.

setwd(getwd())

getwd() gets the current working directory. If you opened the file from within your director, then getwd() will resort to the working directory you want to set.

Step 5Read the data into R workspace

Execute the following command to read your input .csv file.

# Read the data file
nyc_trash_data = read.csv("DSNY_Monthly_Tonnage_Data.csv",header=T)

Since the input file has a header for each column, we should have header=T in the command when we read the file to ensure that all the rows are correctly read.

Your RStudio interface will look like this when you execute this line and open the file from the Global Environment.

Step 6Extracting a subset of the data

As I said before, for today’s workout, let’s use the data from Manhattan’s community district 9 for February and August. Execute the following lines to extract this subset of the data.

# Extract February refuse data for Manhattan's community district 9 
feb_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==2))

feb_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[feb_index]

# Extract August refuse data for Manhattan's community district 9
aug_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==8))

aug_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[aug_index]

In the first line, we look up for the row index of the data corresponding to Borough=Manhattan, Community District = 9, and Month = 2.

In the second line, we extract the data on refuse tons collected for these rows.

The next two lines repeat this process for August data.

Sample 1, February data for Manhattan’s community district 9 looks like this:
2953.1, 2065.8, 2668.2, 2955.4, 2799.4, 2346.7, 2359.6, 2189.4, 2766.1, 3050.7, 2175.1, 2104.0, 2853.4, 3293.2, 2579.1, 1979.6, 2749.0, 2871.9, 2612.5, 455.9, 1951.8, 2559.7, 2753.8, 2691.0, 2953.1, 3204.7, 2211.6, 2857.9, 2872.2, 1956.4, 1991.3

Sample 2, August data for Manhattan’s community district 9 looks like this:
2419.1, 2343.3, 3032.5, 3056.0, 2800.7, 2699.9, 3322.3, 3674.0, 3112.2, 3345.1, 3194.0, 2709.5, 3662.6, 3282.9, 2577.9, 3179.1, 2460.9, 2377.1, 3077.6, 3332.8, 2233.5, 2722.9, 3087.8, 3353.2, 2992.9, 3212.3, 2590.1, 2978.8, 2486.8, 2348.2

The values are in tons per month, and there are 31 values for sample 1 and 30 values for sample 2. Hence, n_{1}=31 and n_{2}=30.

The Questions for Hypotheses

Let’s ask the following questions.

  1. Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
  2. Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
  3. Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?

We can visualize their distributions. Execute the following lines to create a neat graphic of the boxplots of sample 1 and sample 2.

# Visualizing the distributions of the two samples
boxplot(cbind(feb_refuse_data,aug_refuse_data),horizontal=T, main="Refuse from Manhattan's Community District 9")
text(1500,1,"February Tonnage",font=2)
text(1500,2,"August Tonnage",font=2)

p_threshold = 2500 # tons of refuse
abline(v=p_threshold,lty=2,col="maroon")

You should now see your plot space changing to this.

Next, you can execute the following lines to compute the preliminaries required for our tests.

# Preliminaries
  # sample sizes
  n1 = length(feb_refuse_data)
  n2 = length(aug_refuse_data)
  
  # sample means
  x1bar = mean(feb_refuse_data)
  x2bar = mean(aug_refuse_data)

  # sample variances
  x1var = var(feb_refuse_data)
  x2var = var(aug_refuse_data)

  # sample proportions
  p1 = length(which(feb_refuse_data < p_threshold))/n1
  p2 = length(which(aug_refuse_data < p_threshold))/n2

Hypothesis Test on the Difference in Means

We learned four methods to verify the hypothesis on the difference in means:

  • The two-sample t-Test
  • Welch’s two-sample t-Test
  • Wilcoxon’s Rank-sum Test
  • The Bootstrap Test 

Let’s start with the two-sample t-Tests. Assuming a two-sided alternative, the null and alternate hypothesis are:

H_{0}: \mu_{1} - \mu_{2}=0

H_{A}: \mu_{1} - \mu_{2} \neq 0

The alternate hypothesis indicates that substantial positive or negative differences will allow the rejection of the null hypothesis.

We know that for the two-sample t-Test, we assume that the population variances are equal, and the test-statistic is t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the pooled variance. t_{0} follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom. We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

Execute the following lines in R.

pooled_var = ((n1-1)/(n1+n2-2))*x1var + ((n2-1)/(n1+n2-2))*x2var

t0 = (x1bar-x2bar)/sqrt(pooled_var*((1/n1)+(1/n2)))

df = n1+n2-2

pval = pt(t0,df=df)

print(pooled_var)
print(df)
print(t0)
print(pval)

We first computed the pooled variance s^{2} and, using it, the test-statistic t_{0}. Then, based on the degrees of freedom, we compute the p-value.

The p-value is 0.000736. For a two-sided test, we compare this with \frac{\alpha}{2}. If we opt for a rejection rate \alpha of 5%, then, since our p-value of 0.000736 is less than 0.025, we reject the null hypothesis that the means are equal.

All these steps can be implemented using a one-line command in R.

Try this:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = TRUE)

You will get the following prompt in the console.

Two Sample t-test

data: feb_refuse_data and aug_refuse_data

t = -3.3366, df = 59, p-value = 0.001472

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -658.2825 -164.7239

sample estimates:

mean of x mean of y 

 2510.697 2922.200 

As inputs of the function, we need to provide the data (sample 1 and sample 2). We should also indicate that we evaluate a two-sided alternate hypothesis and that the population variances are equal. This will prompt the function to execute the standard two-sample t-test.

While the function provides additional information, all we need for our test are pointers from the first line. t=-3.3366 and df=59 are the test-statistic and the degrees of freedom that we computed earlier. The p-value looks different than what we computed. It is two times the value we estimated. The function provides a value that is double that of the original p-value to allow us to compare it to \alpha instead of \frac{\alpha}{2}. Either way, we know that we can reject the null hypothesis.


Next, let’s implement Welch’s t-Test. When the population variances are not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, the test-statistic is t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}, and it follows an approximate T-distribution with f degrees of freedom which can be estimated using the Satterthwaite’s equation: f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}

Execute the following lines.

f = (((x1var/n1)+(x2var/n2))^2)/(((x1var/n1)^2/(n1-1))+((x2var/n2)^2/(n2-1)))

t0 = (x1bar-x2bar)/sqrt((x1var/n1)+(x2var/n2))

pval = pt(t0,df=f)

print(f)
print(t0)
print(pval)

The pooled degrees of freedom is 55, the test-statistic is -3.3528, and the p-value is 0.000724. Comparing this with 0.025, we can reject the null hypothesis.

Of course, all this can be done using one line where you indicate that the variances are not equal:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = FALSE)

When you execute the above line, you will see the following on your console.

Welch Two Sample t-test
data: feb_refuse_data and aug_refuse_data
t = -3.3528, df = 55.338, p-value = 0.001448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-657.4360 -165.5705
sample estimates:
mean of x mean of y
2510.697 2922.200

Did you notice that it runs Welch’s two-sample t-Test when we indicate that the variances are not equal? Like before, we can compare the p-value to 0.05 and reject the null hypothesis.

The mean of the February refuse collected from Manhattan’s community district 9 is different from August. Since the test-statistic is negative and the p-value is lower than the chosen level of rejection rate, we can conclude that the mean of February refuse tonnage is significantly lower than the mean of August refuse tonnage beyond a reasonable doubt.


How about Wilcoxon’s Rank-sum Test? The null hypothesis is based on the proposition that if x_{1} and x_{2} are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

H_{0}: P(x_{1} > x_{2}) = 0.5

H_{0}: P(x_{1} > x_{2}) \neq 0.5

We know that this ranking and coming up with the rank-sum table is tedious for larger sample sizes. For larger sample sizes, the null distribution of the test-statistic W, which is the sum of the ranks associated with the variable of smaller sample size in the pooled and ordered data, tends to a normal distribution. This allows the calculation of the p-value by comparing W to a normal distribution. We can use the following command in R to implement this.

wilcox.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will get the following message in the console when you execute the above line.

Wilcoxon rank sum test with continuity correction
data: feb_refuse_data and aug_refuse_data
W = 248, p-value = 0.001788
alternative hypothesis: true location shift is not equal to 0

Since the p-value of 0.001788 is lower than 0.05, the chosen level of rejection, we reject the null hypothesis that x_{1} and x_{2} are samples from the same distribution.


Finally, let’s take a look at the bootstrap method. The null hypothesis is that there is no difference between the means. 

H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) = 0.5

H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) \neq 0.5

We first create a bootstrap replicate of X and Y by randomly drawing with replacement n_{1} values from X and n_{2} values from Y.

For each bootstrap replicate from X and Y, we compute the statistics \bar{x}_{Feb} and \bar{x}_{Aug} and check whether \bar{x}_{Feb}>\bar{x}_{Aug}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics \bar{x}_{Feb} and \bar{x}_{Aug}, and verifying whether \bar{x}_{Feb}>\bar{x}_{Aug} and registering S_{i} \in (0,1) a large number of times, say N=10,000.

The proportion of times S_{i} = 1 in a set of N bootstrap-replicated statistics is the p-value.

Execute the following lines in R to implement these steps.

#4. Bootstrap
  N = 10000
  null_mean = matrix(0,nrow=N,ncol=1)
  null_mean_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_mean_ratio[i] = mean(xboot)/mean(yboot) 
     if(mean(xboot)>mean(yboot)){null_mean[i]=1} 
   }
 
 pvalue_mean = sum(null_mean)/N
 hist(null_mean_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="Xbar/Ybar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(0.95,1000,paste("p-value=",pvalue_mean),col="red")

Your RStudio interface will then look like this:

A vertical bar will show up at a ratio of 1 to indicate that the area beyond this value is the proportion of times S_{i} = 1 in 10,000 bootstrap-replicated means.

The p-value is close to 0. We reject the null hypothesis since the p-value is less than 0.025. Since more than 97.5% of the times, the mean of February tonnage is less than August tonnage; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

In summary, while the standard t-Test, Welch's t-Test, and Wilcoxon's Rank-sum Test can be implemented in R using single-line commands,

t.test(x1,x2,alternative="two.sided",var.equal = TRUE)
t.test(x1,x2,alternative="two.sided",var.equal = FALSE)
wilcox.test(x1,x2,alternative = "two.sided")

the bootstrap test needs a few lines of code.

Hypothesis Test on the Equality of Variances

We learned two methods to verify the hypothesis on the equality of variances; using F-distribution and using the bootstrap method.

For the F-distribution method, the null and the alternate hypothesis are 

H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}

H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}

The test-statistic is the ratio of the sample variances, f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic f_{0} in the null distribution.

Execute the following lines.

# 1. F-Test
  f0 = x1var/x2var
  
  df_numerator = n1-1
  
  df_denominator = n2-1
  
  pval = 1-pf(f0,df1=df_numerator,df2=df_denominator)
  
  print(f0)
  print(df_numerator)
  print(df_denominator)
  print(pval)

The test-statistic, i.e., the ratio of the sample variances, is 1.814.

The p-value, which, in this case, is the probability of finding a value greater than the test-statistic (P(F > f_{0})), is 0.0562. When compared to a one-sided rejection rate (\frac{\alpha}{2}) of 0.025, we cannot reject the null hypothesis

We do have a one-liner command.

var.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will find the following message when you execute the var.test() command.

F test to compare two variances
data: feb_refuse_data and aug_refuse_data
F = 1.814, num df = 30, denom df = 29, p-value = 0.1125
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8669621 3.7778616
sample estimates:
ratio of variances
1.813959

As usual, with the standard R function, we need to compare the p-value (which is two times what we computed based on the right-tail probability) with \alpha=0.05. Since 0.1125 > 0.05, we cannot reject the null hypothesis that the variances are equal.


Can we come to the same conclusion using the bootstrap test? Let’s try. Execute the following lines in R.

# 2. Bootstrap
 N = 10000
   null_var = matrix(0,nrow=N,ncol=1)
   null_var_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_var_ratio[i] = var(xboot)/var(yboot) 
     if(var(xboot)>var(yboot)){null_var[i]=1} 
   }
 
 pvalue_var = sum(null_var)/N
 hist(null_var_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="XVar/YVar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,500,paste("p-value=",pvalue_var),col="red")

Your RStudio interface should look like this if you correctly coded up and executed the lines. The code provides a way to visualize the null distribution.

The p-value is 0.7945. 7945 out of the 10,000 bootstrap replicates had s^{2}_{Feb}>s^{2}_{Aug}. For a 5% rate of error (\alpha=5\%), we cannot reject the null hypothesis since the p-value is greater than 0.025. The evidence (20.55% of the times) that the February tonnage variance is less than August tonnage is not sufficient to reject equality.

In summary, use var.test(x1,x2,alternative = "two.sided") if you want to verify based on F-distribution, or code up a few lines if you want to verify using the bootstrap method.

Hypothesis Test on the Difference in Proportions

For the hypothesis test on the difference in proportions, we can employ Fisher’s Exact Test, use the normal approximation under the large-sample assumption, or the bootstrap method.

For Fisher’s Exact Test, the test-statistic follows a hypergeometric distribution when H_{0} is true. We could assume that the number of successes is fixed at t=x_{1}+x_{2}, and, for a fixed value of t, we reject H_{0}:p_{1}=p_{2} for the alternate hypothesis H_{A}:p_{1}>p_{2} if there are more successes in random variable X_{1} compared to X_{2}.

The p-value can be derived under the assumption that the number of successes X=k in the first sample X_{1} has a hypergeometric distribution when H_{0} is true and conditional on a total number of t successes that can come from any of the two random variables X_{1} and X_{2}.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

Execute the following lines to implement this process, plot the null distribution and compute the p-value.

#Fisher's Exact Test
  x1 = length(which(feb_refuse_data < p_threshold))
  p1 = x1/n1
  
  x2 = length(which(aug_refuse_data < p_threshold))
  p2 = x2/n2
  
  N = n1+n2
  t = x1+x2
  
  k = seq(from=0,to=n1,by=1)
  p = k
  for(i in 1:length(k)) 
   {
    p[i] = (choose(t,k[i])*choose((N-t),(n1-k[i])))/choose(N,n1)
   }

  plot(k,p,type="h",xlab="Number of successes in X1",ylab="P(X=k)",font=2,font.lab=2)
  points(k,p,type="o",lty=2,col="grey50")
  points(k[13:length(k)],p[13:length(k)],type="o",col="red",lwd=2)
  points(k[13:length(k)],p[13:length(k)],type="h",col="red",lwd=2)
  pvalue = sum(p[13:length(k)])
  print(pvalue)

Your RStudio should look like this once you run these lines.

The p-value is 0.1539. Since it is greater than the rejection rate, we cannot reject the null hypothesis that the proportions are equal.

We can also leverage an in-built R function that does the same. Try this.

fisher_data = cbind(c(x1,x2),c((n1-x1),(n2-x2)))
fisher.test(fisher_data,alternative="greater")

In the first line, we create a 2×2 contingency table to indicate the number of successes and failures in each sample. Sample 1 has 12 successes, i.e., 12 times the tonnage in February was less than 2500 tons. Sample 2 has seven successes. So the contingency table looks like this:

    [,1]  [,2]
[1,] 12    19
[2,]  7    23

Then, fisher.test() funtion implements the hypergeometric distribution calculations.

You will find the following prompt in your console.

Fisher’s Exact Test for Count Data
data: fisher_data
p-value = 0.1539
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.712854 Inf
sample estimates:
odds ratio
2.050285

We get the same p-value as before — we cannot reject the null hypothesis.


Under the normal approximation for large sample sizes, the test-statistic z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1). p is the pooled proportion which can be compute as p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}.

We reject the null hypothesis when the p-value P(Z \ge z) is less than the rate of rejection \alpha.

Execute the following lines to set this up in R.

# Z-approximation
  p = (x1+x2)/(n1+n2)
  z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2)))
  pval = 1-pnorm(z)

The p-value is 0.0974. Since it is greater than 0.05, we cannot reject the null hypothesis.


Finally, we can use the bootstrap method as follows.

# Bootstrap
 N = 10000
 null_prop = matrix(0,nrow=N,ncol=1)
 null_prop_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
 
     p1boot = length(which(xboot < p_threshold))/n1 
     p2boot = length(which(yboot < p_threshold))/n2 

     null_prop_ratio[i] = p1boot/p2boot 
    if(p1boot>p2boot){null_prop[i]=1} 
  }
 
 pvalue_prop = sum(null_prop)/N
 hist(null_prop_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="P1/P2",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,250,paste("p-value=",pvalue_prop),col="red")

We evaluated whether the low trash situation in February is different than August.

H_{0}: P(p_{Feb}>p_{Aug}) = 0.5

H_{A}: P(p_{Feb}>p_{Aug}) > 0.5

The p-value is 0.8941. 8941 out of the 10,000 bootstrap replicates had p_{Feb}>p_{Aug}. For a 5% rate of error (\alpha=5\%), we cannot reject the null hypothesis since the p-value is not greater than 0.95. The evidence (89.41% of the times) that the low trash situation in February is greater than August is insufficient to reject equality.

Your Rstudio interface will look like this.

In summary, 
use fisher.test(contingency_table,alternative="greater") 
if you want to verify based on Fisher's Exact Test; 

use p = (x1+x2)/(n1+n2); 
    z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2))); 
    pval = 1-pnorm(z) 
if you want to verify based on the normal approximation; 

code up a few lines if you want to verify using the bootstrap method.

  1. Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
    • Reject the null hypothesis. They are different.
  2. Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
    • Cannot reject the null hypothesis. They are probably the same.
  3. Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?
    • Cannot reject the null hypothesis. They are probably the same.

Please take the next two weeks to digest all the two-sample hypothesis tests, including how to execute them in R.

Here is the full code for today’s lesson.

In a little over four years, we reached 99 lessons 😎

I will have something different for the 100. Stay tuned …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-Sample Hypothesis Tests – Part VII

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “Open Data for All New Yorkers” page.

Interestingly, I found data on the bicycle counts for East River Bridges. New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

They highlight all non-holiday weekdays with no precipitation in yellow.

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

We are looking at the boxplots that present a nice visual of the data range and its percentiles. And we can compare one sample to another. Remember Lesson 14? There is a vertical line at 6352 bikes, the maximum number of bikes on Manhattan Bridge during weekends, holidays, or rainy days — i.e., the non-highlighted days.

I want answers to the following questions.

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \bar{x}_{M}=\bar{x}_{W}?
Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \tilde{x}_{M}=\tilde{x}_{W}?
Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. s^{2}_{M}=s^{2}_{W}?
Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. IQR_{M}=IQR_{W}?
Is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge. P(M<6352)=P(W<6352) or p_{M}=p_{W}?

What do we know so far? 

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal (Lesson 94) or using Welch’s t-Test when we cannot assume equality of population variances (Lesson 95). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means (Lesson 96).

We know how to test the equality of variances using F-distribution (Lesson 97).

We know how to test the difference in proportions using either Fisher’s Exact test (Lesson 92) or using the normal distribution as the null distribution under the large-sample approximation (Lesson 93).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.

  • What is the limiting distribution of the test-statistic that computes the difference in medians?
  • What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
  • What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

The idea of the bootstrap is that we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely with a probability of 1/n, we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Take the data for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values — with replacement.

It is like playing the game of Bingo where the chips are these eight numbers. Each time we get a number, we put it back and roll it again until we draw eight numbers.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate.
6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate.
6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution \hat{f}, i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating \hat{f}, the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic. As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

Let’s now use the bootstrap method for two-sample hypothesis tests. Suppose there are two random variables, X and Y, and any statistic computed from them are \theta_{x} and \theta_{y}. We may have a sample of n_{1} values representing X and a sample of n_{2} values to represent Y.

\theta_{x},\theta_{y} can be mean, median, variance, proportion, etc. Any computable statistic from the original data is of the form \theta_{x},\theta_{y}.

The null hypothesis is that there is no difference between the statistic of X or Y.

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

The alternate hypothesis is

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

We first create a bootstrap replicate of X and Y by randomly drawing with replacement n_{1} values from X and n_{2} values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics \theta_{x} and \theta_{y} and check whether \theta_{x}>\theta_{y}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

xboot: 6691 5311 6774 5311 6359 5311 5311 6052
yboot: 6775 6881 7341 7196 6775 7341 6775 7196

Mean of this bootstrap replicate for X and Y are 5890 and 7035. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

Another bootstrap replicate for X and Y may look like this:

xboot: 6774 6359 6359 6359 6052 5054 6052 6691
yboot: 6775 7196 7196 6026 6881 7341 6881 5711

Mean of this bootstrap replicate for X and Y are 6212.5 and 6750.875. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics \theta_{x} and \theta_{y}, and verifying whether \theta_{x}>\theta_{y} and registering S_{i} \in (0,1) a large number of times, say N=10,000.

The proportion of times S_{i} = 1 in a set of N bootstrap-replicated statistics is the p-value.

p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}

Based on the p-value, we can use the rule of rejection at the selected rate of rejection \alpha.

For a one-sided alternate hypothesis, we reject the null hypothesis if p-value < \alpha or p-value > 1-\alpha.

For a two-sided alternate hypothesis, we reject the null hypothesis if p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}.

Manhattan Bridge vs. Williamsburg Bridge

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5

H_{A}: P(\bar{x}_{M}> \bar{x}_{W}) \neq 0.5

Let’s take a two-sided alternate hypothesis.

Here is the null distribution of \frac{\bar{x}_{M}}{\bar{x}_{W}} for N = 10,000.

A vertical bar is shown at a ratio of 1 to indicate that the area beyond this value is the proportion of times S_{i} = 1 in a set of 10,000 bootstrap-replicated means.

The p-value is 0.0466. 466 out of the 10,000 bootstrap replicates had \bar{x}_{M}>\bar{x}_{W}. For a 10% rate of error (\alpha=10\%), we reject the null hypothesis since the p-value is less than 0.05. Since more than 95% of the times, the mean of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

Can we reject the null hypothesis if we select a 5% rate of error?



Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5

H_{A}: P(\tilde{x}_{M}> \tilde{x}_{W}) \neq 0.5

The null distribution of \frac{\tilde{x}_{M}}{\tilde{x}_{W}} for N = 10,000.

The p-value is 0.1549. 1549 out of the 10,000 bootstrap replicates had \tilde{x}{M}>\tilde{x}{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (84.51% of the times) that the median of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(s^{2}_{M}>s^{2}_{W}) = 0.5

H_{A}: P(s^{2}_{M}>s^{2}_{W}) \neq 0.5

The null distribution of \sqrt{\frac{s^{2}_{M}}{s^{2}_{W}}} for N = 10,000. We are looking at the null distribution of the ratio of the standard deviations.

The p-value is 0.4839. 4839 out of the 10,000 bootstrap replicates had s^{2}_{M}>s^{2}_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (51.61% of the times) that the variance of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(IQR_{M}>IQR_{W}) = 0.5

H_{A}: P(IQR_{M}>IQR_{W}) \neq 0.5

The null distribution of \frac{IQR_{M}}{IQR_{W}} for N = 10,000. It does not resemble any known distribution, but that does not restrain us since the bootstrap-based hypothesis test is distribution-free.

The p-value is 0.5453. 5453 out of the 10,000 bootstrap replicates had IQR_{M}>IQR_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (45.47% of the times) that the interquartile range of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Finally, is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge?

H_{0}: P(p_{M}>p_{W}) = 0.5

H_{A}: P(p_{M}>p_{W}) \neq 0.5

The null distribution of \frac{p_{M}}{p_{W}} for N = 10,000.

The p-value is 0.5991. 5991 out of the 10,000 bootstrap replicates had p_{M}>p_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (40.09% of the times) that the proportion of the total bike counts less than 6352 on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent. 

To summarize,

Repeatedly sample with replacement from original samples of X and Y -- N times.
Each time draw a sample of size n_{1} from X and a sample of size n_{2} from Y.
Compute the desired statistic (mean, median, skew, etc.) from each
bootstrap sample.
The null hypothesis P(\theta_{x}>\theta_{y})=0.5 can now be tested as follows:
  • S_{i}=1 if \theta_{x}>\theta_{y}, else, S_{i}=0
  • p-value=\frac{1}{N}\sum_{i=1}^{N}S_{i} (average over all N bootstrap-replicated statistics)
  • If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis for a two-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value < \alpha, reject the null hypothesis for a left-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value > 1-\alpha, reject the null hypothesis for a right-sided hypothesis test at a selected rejection rate of \alpha.

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 90 – The One-Sample Hypothesis Tests Using the Bootstrap

Hypothesis Tests – Part V

H_{0}: P(\theta > \theta^{*}) = 0.5

H_{A}: P(\theta > \theta^{*}) > 0.5

H_{A}: P(\theta > \theta^{*}) < 0.5

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Jenny and Joe meet after 18 lessons.

I heard you are neck-deep into the hypothesis testing concepts.

Yes. And I am having fun learning about how to test various hypotheses, be in on the mean, on the standard deviation, or the proportion. It is also enlightening to learn how to approximate the null distribution using the limiting distribution concepts.

True. You have seen in lesson 86 — hypothesis tests on the proportion, that the null distribution is a Binomial distribution with n, the sample size, and p, the proportion being tested.

You have seen in lesson 87 — hypothesis tests on the mean, that the null distribution is a T-distribution because \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}

You have seen in lesson 88 — hypothesis tests on the variance, that the null distribution is a Chi-square distribution because \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}

Did you ever wonder what if the test statistic is more complicated mathematically than the mean or the variance or the proportion and if its limiting distribution or the null distribution is hard to derive?

Or, did you ever ask what if the assumptions that go into deriving the null distribution are not met or not fully satisfied?

Can you give an example?

Suppose you want to test the hypothesis on the median or the interquartile range or the skewness of a distribution?

Or, if you are unsure about the distributional nature of the sample data? For instance, the assumption that \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1} is based on the premise that the sample data is normally distributed.

😕 There are non-parametric or distribution-free approaches? I remember Devine mentioning a bootstrap approach.

😎 In lesson 79, we learned about the concept of the bootstrap. Using the idea of the bootstrap, we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely (with a probability of 1/n), we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Hmm. Since each bootstrap replicate is a possible representation of the population, we can compute the test statistics from this bootstrap sample. And, by repeating this, we can have many simulated values of the test statistics to create the null distribution against which we can test the hypothesis.

Exactly. No need to make any assumption on the distributional nature of the data at hand, or the kind of the limiting distribution for the test statistic. We can compute any test statistic from the bootstrap replicates and test the basis value using this simulated null distribution. You want to test the hypothesis on the median, go for it. On the skewness or geometric mean, no problem.

This sounds like an exciting approach that will free up the limitations. Why don’t we do it step by step and elaborate on the details. Our readers will appreciate it.

Absolutely. Do you want to use your professor’s hypothesis that the standard deviation of his class’s performance is 16.5 points, as a case in point?

Sure. In a recent conversation, he also revealed that the mean and median scores are 54 and 55 points, respectively and that 20% of his class usually get a score of less than 40.

Aha. We can test all four hypotheses then. Let’s take the sample data.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Yes, this is a sample of ten exam scores from our most recent test with him.

Let’s first review the concept of the bootstrap. We have the following data.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these ten data points is 1/10, we can randomly draw ten numbers from these ten values — with replacement.

Yes, I can recall from lesson 79 that this is like playing the game of Bingo where the chips are these ten numbers. Each time we get a number, we put it back and roll it again until we draw ten numbers.

Yes. For real computations, we use a computer program that has this algorithm coded in. We draw a random number from a uniform distribution (f(u)) where u is between 0 and 1. These randomly drawn u's are mapped onto the ranked data to draw a specific value from the set. For example, in a set of ten values, for a randomly drawn u of 0.1, we can draw the first value in order — 33.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (60, 41, 70, 61, 69, 95, 33, 34, 82, 82), some may appear more than one time, and some may not show up at all in a random sample.

Let me create a bootstrap replicate.

See, 70 appeared two times, 82 appeared three times, and 33 did not get selected at all.

Such bootstrap replicates are representations of the empirical distribution \hat{f}. The empirical distribution \hat{f} is the proportion of times each value in the data sample x_{1}, x_{2}, x_{3}, …, x_{n} occurs. If we assume that the data sample has been generated by randomly sampling from the true distribution, then, the empirical distribution (i.e., the observed frequency) \hat{f} is a sufficient statistic for the true distribution f.

In other words, all the information contained in the true distribution can be generated by creating \hat{f}, the empirical distribution.

Yes. Since an unknown population distribution f has produced the observed data x_{1}, x_{2}, x_{3}, …, x_{n}, we can use the observed data to approximate f by its empirical distribution \hat{f} and then use \hat{f} to generate bootstrap replicates of the data.

How do we implement the hypothesis test then?

Using the same hypothesis testing framework. We first establish the null and the alternative hypothesis.

H_{0}: P(\theta > \theta^{*}) = 0.5

\theta is the test statistic computed from the bootstrap replicate and \theta^{*} is the basis value that we are testing. For example, a standard deviation of 16.5 is \theta^{*} and standard deviation computed from one bootstrap sample is \theta.

The alternate hypothesis is then,

H_{A}: P(\theta > \theta^{*}) > 0.5

or

H_{A}: P(\theta > \theta^{*}) < 0.5

or

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Essentially, for each bootstrap replicate i, we check whether \theta_{i} > \theta^{*}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

Now, we can repeat this process, i.e., creating a bootstrap replicate, computing the test statistic and verifying whether \theta_{i} > \theta^{*} or S_{i} \in (0,1) a large number of times, say N = 10,000. The proportion of times S_{i} = 1 in a set of N bootstrap-replicated test statistics is the p-value.

And we can apply the rule of rejection if the p-value < \alpha, the selected rate of rejection.

Correct. That is for a one-sided hypothesis test. If it is a two-sided hypothesis test, we use the rule \frac{\alpha}{2} \le p-value < 1- \frac{\alpha}{2} for non-rejection, i.e., we cannot reject the null hypothesis if the p-value is between \frac{\alpha}{2} and 1-\frac{\alpha}{2}.

Great! For the first bootstrap sample, if we were to verify the four hypotheses, we register the following.

Since the bootstrap sample mean \bar{x}_{boot}=67.8 is greater than the basis of 54, we register S_{i}=1.

Since the bootstrap sample median \tilde{x}_{boot}=69.5 is greater than the basis of 55, we register S_{i}=1.

Since the bootstrap sample standard deviation \sigma_{boot}=14.46 is less than the basis of 16.5, we register S_{i}=0.

Finally, since the bootstrap sample proportion p_{boot}=0.1 is less than the basis of 0.2, we register S_{i}=0.

We do this for a large number of bootstrap samples. Here is an illustration of the test statistics for three bootstrap replicates.

Let me run the hypothesis test on the mean using N = 10,000. I am creating 10,000 bootstrap-replicated test statistics.

The distribution of the test statistics is the null distribution of the mean. Notice that it resembles a normal distribution. The basis value of 54 is shown using a blue square on the distribution. From the null distribution, the proportion of times S_{i}=1 is 0.91. 91% of the \bar{x} test statistics are greater than 54.

Our null hypothesis is
H_{0}: P(\bar{x} > 54) = 0.5

Our alternate hypothesis is one-sided. H_{A}: P(\bar{x} > 54) < 0.5

Since the p-value is greater than a 5% rejection rate, we cannot reject the null hypothesis.

If the basis value \mu is far out on the null distribution of \bar{x} that less than 5% of the bootstrap-replicated test statistics are greater than \mu, we would have rejected the null hypothesis.

Shall we run the hypothesis test on the median?

H_{0}: P(\tilde{x} > 55) = 0.5

H_{A}: P(\tilde{x} > 55) < 0.5

Again, a one-sided test.

Sure. Here is the answer.

We can see the null distribution of the test statistic (median from the bootstrap samples) along with the basis value of 55.

86% of the test statistics are greater than this basis value. Hence, we cannot reject the null hypothesis.

The null distribution of the test statistic does not resemble any known distribution.

Yes. Since the bootstrap-based hypothesis test is distribution-free (non-parametric), not knowing the nature of the limiting distribution of the test statistic (median) does not restrain us.

Awesome. Let me also run the test for the standard deviation.

H_{0}: P(\sigma > 16.5) = 0.5

H_{A}: P(\sigma > 16.5) \ne 0.5

I am taking a two-sided test since a deviation in either direction, i.e., too small a standard deviation or too large of a standard deviation will disprove the hypothesis.

Here is the result.

The p-value is 0.85, i.e., 85% of the bootstrap-replicated test statistics are greater than 16.5. Since the p-value is greater than the acceptable rate of rejection, we cannot reject the null hypothesis.

If the p-value were less than 0.025 or greater than 0.975, then we would have rejected the null hypothesis.

For a p-value of 0.025, 97.5% of the bootstrap-replicated standard deviations will be less than 16.5 — strong evidence that the null distribution produces values much less than 16.5. For a p-value of 0.975, 97.5% of the bootstrap-replicated standard deviations will be greater than 16.5 — strong evidence that the null distribution produces values much greater than 16.5. In either of the sides, we reject the null hypothesis that the standard deviation is 16.5.

Let me complete the hypothesis test on the proportion.

H_{0}: P(p > 0.2) = 0.5

H_{A}: P(p > 0.2) \ne 0.5

Let’s take a two-sided test since deviation in either direction can disprove the null hypothesis. If we get a tiny proportion or a very high proportion compared to 0.2, we will reject the belief that the percentage of students obtaining a score of less than 40 is 0.2.

Here are the null distribution and the result from the test.

The p-value is 0.32. 3200 out of the 10000 bootstrap-replicated proportions are greater than 0.2. Since it is between 0.025 and 0.975, we cannot reject the null hypothesis.

You can see how widely the bootstrap concept can be applied for hypothesis testing and what flexibility it provides.

To summarize:

Repeatedly sample with replacement from the original sample data. 
Each time, draw a sample of size n.
Compute the desired statistic from each bootstrap sample.
(mean, median, standard deviation, interquartile range,
proportion, skewness, etc.)
Null hypothesis P(\theta > \theta^{*}) = 0.5 can now be tested as follows: 
S_{i}=1 if \theta_{i} > \theta^{*}, else, S_{i}=0

p-value = \frac{1}{N}\sum_{i=1}^{i=N} S_{i}
(average over all N bootstrap-replicated test statistics)

If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis
(for a two-sided hypothesis test at a selected rejection rate of \alpha)

If p-value < \alpha, reject the null hypothesis
(for a left-sided hypothesis test at a rejection rate of \alpha)

If p-value > 1 - \alpha, reject the null hypothesis
(for a right-sided hypothesis test at a rejection rate of \alpha)

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 79 – Pull yourself up by your bootstraps

Over the past several lessons we have been learning about estimates, standard error of the estimates and confidence interval of the estimates.

We have been using the ‘sample’ to estimate the true value of the parameter. What we estimate from the sample will enable us to obtain the closest answer in some sense for the true unknown population parameter.

For example, the mean \bar{x}, variance s^{2}, or proportion \hat{p} computed from the sample data are good guesses (estimates or estimators) of the mean \mu, variance \sigma^{2} and proportion p of the population.

We also know that when we think of an estimate, we think of an interval or a probability distribution, instead of a point value. The truth may be in this interval if we have good representative samples, i.e., if the sample distribution is similar to the population distribution.

Assumptions or Approximations

In this inferential journey, to compute the standard error or to derive the confidence interval of the estimates, we have been making some assumptions and approximations that are deemed reasonable.

For example, it is reasonable to assume a normal distribution for the sample mean \bar{x}.

\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}]=\mu.

The standard deviation of the sample mean, or the standard error of the estimate is \frac{\sigma}{\sqrt{n}}.

This visual should be handy.

To derive the confidence interval of the variance and standard deviation, we assumed that \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

Depending on the degrees of freedom, the distribution of \frac{(n-1)s^{2}}{\sigma^{2}} looks like this.

Most recently, we assumed that the estimate for proportion \hat{p} can be approximated by a normal distribution.

\hat{p} \sim N(p, \frac{p(1-p)}{n})

We derived the confidence interval of the population proportion as [\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}], based on this assumption.

Let’s examine this assumption once again.

In a sample of size n, proportion can be estimated as \hat{p} = \frac{S_{n}}{n}, where S_{n} is the number of favorable instances for the thing we are measuring. \hat{p} can be approximated to a normal distribution since S_{n} can be approximated to a normal distribution.

If we take Bernoulli random variables (0,1) for X_{1}, X_{2}, X_{3}, …, X_{n}, S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}, the number of successes, follows a Binomial distribution f(x) = \frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}.

For a large enough sample size n, the distribution function of S_{n} can be well-approximated by the normal distribution.

Let’s do some experiments and see if this is reasonable.

Look at this animation. I am showing the Binomial probability distribution function for p = 0.5 while n increases from 10 to 100.

It looks like an approximation to a normal distribution is very reasonable.

Now, look at these two animations that show the Binomial probability function for p = 0.1 and p = 0.95, i.e., when p is near the boundaries.

Clearly, the distributions exhibit skew and are not symmetric. An approximation to normal distribution even for large values of n, i.e., a big sample, is not appropriate.

How then can we be confident about the standard error and the confidence intervals?

For that matter, how can we derive the standard error or the intervals of a parameter whose limiting form is not known, or mathematically very complicated?

Enter the Bootstrap

Bradley Efron invented a computer-based method, the bootstrap, for estimating the standard error and the confidence intervals of parameters. There is no need for any theoretical calculations or underlying assumptions regarding the mathematical structure or the type of the distribution of the parameter. Instead, bootstrap samples from the data are used.

What is a bootstrap sample?

Suppose we have a data sample, x_{1},x_{2}, … ,x_{n} , a bootstrap sample is a random sample of size n drawn with replacement from these n data points.

Imagine we have the following data: 28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7 as the concentration of Ozone measured in seven locations in New York City.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these seven data points is 1/7, we can randomly draw seven numbers from these seven values.

Think that you are playing the game of Bingo and these seven numbers are chips in your rolling machine. The only difference is, each time you get a number, record it and put it back in the roller until you draw seven numbers. Sample with replacement.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7), some may appear more than one time, and some may not appear at all in a random sample.

I played this using the roller. Here is a bootstrap sample from the original numbers.

As you can see, 28.4 appeared one time, 28.6, 27.5 and 28.7 did not appear, 26.3 appeared 2 times and 27.7 appeared 3 times.

Here are two more bootstrap samples like that.

Basis

The basis for bootstrap samples is that the sample data can be used to approximate the probability distribution function of the population. As you saw before, by putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Take a very simple example of rolling two dice in the game of monopoly. The true probability distribution f of the count (dice 1 + dice 2) is based on the fact that there are 11 possible outcomes and the likelihood of each outcome is the ratio of the total ways we can get the number to 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 is 1/36.

Suppose we roll the dice a hundred times and record the total count, we can use the observed frequencies of the outcomes from this sample data to approximate the actual probability distribution.

Look at these 100 counts as outcomes of rolling two dice 100 times.

The frequency plot shown in black lines closely approximates the true frequency shown in red.

The empirical distribution \hat{f} is the proportion of times each value in the data sample x_{1}, x_{2}, …, x_{n} occurs. The observed frequency \hat{f} is a sufficient statistic for the true distribution f with an assumption that the data have been generated by random sampling from the true distribution f. All the information of the true distribution f is contained in the empirical distribution \hat{f}.

An unknown population distribution f has produced the observed data x_{1}, x_{2}, …, x_{n}. We can use the observed data to approximate f by its empirical distribution \hat{f} and then use the empirical distribution to generate bootstrap replicates of the data. Since f generated x, \hat{f} can be used to generate the bootstrap samples.

f has given x_{1}, x_{2},…, x_{n} can be used to estimate \hat{f} will be used to generate a bootstrapsample.

This is the basis.

Bootstrap Replicates

Once we generate enough bootstrap samples, we can use the estimators (formula to estimate the parameter) on these samples. For example, if we want to represent the true population mean \mu, we can apply the equation for the sample mean \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} on each of these bootstrap samples to generate bootstrap replicates of the mean.

If we want to represent the true population variance using an interval, we can apply s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} on these bootstrap samples to generate replicates of the variance.

Likewise, if we want an interval for the true proportion, we apply \hat{p} = \frac{S_{n}}{n} on the bootstrap samples to get replicates of the proportion.

Each bootstrap sample will produce a replicate of the parameter. Efron prescribes anywhere between 25 to 200 bootstrap replications for a good approximation of the limiting distribution of the estimate. As the number of bootstrap replicates approaches infinity, the standard error as measured by the standard deviation of these replicates will approach the true standard error.

Let’s look at the bootstrap replicates of the sample mean and the sample standard deviation for the Ozone data for which we used the Bingo machine to generate the bootstrap samples. In a later coding lesson, we will learn how to do it using simple functions in RStudio.

For bootstrap sample 1, the sample mean is 27.26 and the sample standard deviation is 0.82.

For bootstrap sample 2, the sample mean is 27.61 and the sample standard deviation is 0.708.

I do this 200 times. Here is how the distribution of the sample mean (\bar{x}) obtained from 200 bootstrap replicates looks like.

Here is the distribution of the sample standard deviation.

Like this, we can develop the intervals of any type of parameters by applying the relevant estimator on the bootstrap sample.

Bootstrap confidence interval

Finally, we can use the percentiles of the bootstrap replicates as the confidence limits of the parameter.

Take a 90% confidence interval for instance. From the bootstrap replicates, we can say that there is a 90% probability that the true mean \mu will be between \bar{x}_{[5]} and \bar{x}_{[95]}, the 5th and the 95th percentiles of the bootstrap replicates.

P(\bar{x}_{[5]} \le \mu \le \bar{x}_{[95]}) = 0.90

For the Ozone example, the 90% confidence interval of the true mean is [27.114, 28.286] and the 90% confidence interval of the true standard deviation is [0.531 1.087].

Look at these plots.

We can define a 100(1-\alpha)% bootstrap confidence interval for the true mean \mu as [l, u] = [\bar{x}_{[\alpha]}, \bar{x}_{[1-\alpha]}].

There are many more uses of the bootstrap, and we will have much fun with it in due course.

But for today, let’s end with Efron’s statement on what other names were suggested to him for his method.

“I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.””

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)