Lesson 99 – The Two-Sample Hypothesis Tests in R

Over the past seven lessons, we equipped ourselves with the necessary theory of the two-sample hypothesis tests.

Lessons 92 and 93 were about the hypothesis test on the difference in proportions. In Lesson 92, we learned Fisher’s Exact Test to verify the difference in proportions. In Lesson 93, we did the same using the normal distribution as a limiting distribution when the sample sizes are large.

Lessons 94, 95, and 96 were about the hypothesis test on the difference in means. In Lesson 94, we learned that under the proposition that the population variances of two random variables are equal, the test-statistic, which uses the pooled variance, follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom. In Lesson 95, we learned that Welch’s t-Test could be used when we cannot make the equality of population proportions assumption. In Lesson 96, we learned the Wilcoxon’s Rank-sum Test that used a ranking approach to approximate the significance of the differences in means. This method is data-driven, needs no assumptions on the limiting distribution, and works well for small sample sizes.

Lesson 97 was about the hypothesis test on the equality of variances. Here we got a sneak-peak into a new distribution called F-distribution, which is the converging distribution of the ratio of two Chi-square distributions divided by their respective degrees of freedom. The test-statistic is the ratio of the sample variances, which we can verify against an F-distribution with n_{1}-1 numerator degrees of freedom and n_{2}-1 denominator degrees of freedom.

All these hypothesis tests can also be done using the bootstrap approach, which we learned in Lesson 98. It uses the data at hand to generate the null distribution of any desired statistic as long as it is computable from the data. It is very flexible. There is no need to make any assumptions on the data’s distributional nature or the limiting distribution for the test-statistic.

It is now time to put all of this learning to practice.


The City of New York Department of Sanitation (DSNY) archives data on the monthly tonnage of trash collected from NYC residences and institutions.

To help us with today’s workout, I dug up our gold mine, the “Open Data for All New Yorkers” page, and found an interesting dataset on the monthly tonnage of trash collected from NYC residences and institutions.

Here is a preview.

The data includes details on when (year and month) DSNY collected the trash, from where (borough and community district), and how much (tonnage of refuse, source-separated recyclable paper, and source-separated metal, glass, plastic, and beverage cartons). They also have other seasonal data such as the tonnage of leaves collected in November and December and tons of Christmas trees collected from curbside in January.

For our workout today, we can use this file named “DSNY_Monthly_Tonnage_Data.csv.” It is a slightly modified version of the file you will find from the open data page.

My interest is in Manhattan’s community district 9 — Morningside Heights, Manhattanville, and Hamilton Heights.

I want to compare the tonnage of trash collected in winter to that in summer. Let’s say we compare the data in Manhattan’s community district 9 for February and August to keep it simple.

Are you ready?

The Basic Steps

Step1: Get the data

You can get the data from here. The file is named DSNY_Monthly_Tonnage_Data.csv. It is a comma-separated values file which we can conveniently read into the R workspace. 

Step 2Create a new folder on your computer

And call this folder “lesson99.”
Make sure that the data file “DSNY_Monthly_Tonnage_Data.csv” is in this folder.

Step 3Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson99” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson99_code.R

Step 4Choose your working directory

“lesson99” is the folder where we stored the code and the input data file. Use setwd("path") to set the path to this folder. Execute the line by clicking the “Run” button on the top right. 

setwd("path to your folder")

Or, if you opened this “lesson99_code.R” file from within your “lesson99” folder, you can also use the following line to set your working directory.

setwd(getwd())

getwd() gets the current working directory. If you opened the file from within your director, then getwd() will resort to the working directory you want to set.

Step 5Read the data into R workspace

Execute the following command to read your input .csv file.

# Read the data file
nyc_trash_data = read.csv("DSNY_Monthly_Tonnage_Data.csv",header=T)

Since the input file has a header for each column, we should have header=T in the command when we read the file to ensure that all the rows are correctly read.

Your RStudio interface will look like this when you execute this line and open the file from the Global Environment.

Step 6Extracting a subset of the data

As I said before, for today’s workout, let’s use the data from Manhattan’s community district 9 for February and August. Execute the following lines to extract this subset of the data.

# Extract February refuse data for Manhattan's community district 9 
feb_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==2))

feb_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[feb_index]

# Extract August refuse data for Manhattan's community district 9
aug_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==8))

aug_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[aug_index]

In the first line, we look up for the row index of the data corresponding to Borough=Manhattan, Community District = 9, and Month = 2.

In the second line, we extract the data on refuse tons collected for these rows.

The next two lines repeat this process for August data.

Sample 1, February data for Manhattan’s community district 9 looks like this:
2953.1, 2065.8, 2668.2, 2955.4, 2799.4, 2346.7, 2359.6, 2189.4, 2766.1, 3050.7, 2175.1, 2104.0, 2853.4, 3293.2, 2579.1, 1979.6, 2749.0, 2871.9, 2612.5, 455.9, 1951.8, 2559.7, 2753.8, 2691.0, 2953.1, 3204.7, 2211.6, 2857.9, 2872.2, 1956.4, 1991.3

Sample 2, August data for Manhattan’s community district 9 looks like this:
2419.1, 2343.3, 3032.5, 3056.0, 2800.7, 2699.9, 3322.3, 3674.0, 3112.2, 3345.1, 3194.0, 2709.5, 3662.6, 3282.9, 2577.9, 3179.1, 2460.9, 2377.1, 3077.6, 3332.8, 2233.5, 2722.9, 3087.8, 3353.2, 2992.9, 3212.3, 2590.1, 2978.8, 2486.8, 2348.2

The values are in tons per month, and there are 31 values for sample 1 and 30 values for sample 2. Hence, n_{1}=31 and n_{2}=30.

The Questions for Hypotheses

Let’s ask the following questions.

  1. Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
  2. Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
  3. Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?

We can visualize their distributions. Execute the following lines to create a neat graphic of the boxplots of sample 1 and sample 2.

# Visualizing the distributions of the two samples
boxplot(cbind(feb_refuse_data,aug_refuse_data),horizontal=T, main="Refuse from Manhattan's Community District 9")
text(1500,1,"February Tonnage",font=2)
text(1500,2,"August Tonnage",font=2)

p_threshold = 2500 # tons of refuse
abline(v=p_threshold,lty=2,col="maroon")

You should now see your plot space changing to this.

Next, you can execute the following lines to compute the preliminaries required for our tests.

# Preliminaries
  # sample sizes
  n1 = length(feb_refuse_data)
  n2 = length(aug_refuse_data)
  
  # sample means
  x1bar = mean(feb_refuse_data)
  x2bar = mean(aug_refuse_data)

  # sample variances
  x1var = var(feb_refuse_data)
  x2var = var(aug_refuse_data)

  # sample proportions
  p1 = length(which(feb_refuse_data < p_threshold))/n1
  p2 = length(which(aug_refuse_data < p_threshold))/n2

Hypothesis Test on the Difference in Means

We learned four methods to verify the hypothesis on the difference in means:

  • The two-sample t-Test
  • Welch’s two-sample t-Test
  • Wilcoxon’s Rank-sum Test
  • The Bootstrap Test 

Let’s start with the two-sample t-Tests. Assuming a two-sided alternative, the null and alternate hypothesis are:

H_{0}: \mu_{1} - \mu_{2}=0

H_{A}: \mu_{1} - \mu_{2} \neq 0

The alternate hypothesis indicates that substantial positive or negative differences will allow the rejection of the null hypothesis.

We know that for the two-sample t-Test, we assume that the population variances are equal, and the test-statistic is t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the pooled variance. t_{0} follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom. We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

Execute the following lines in R.

pooled_var = ((n1-1)/(n1+n2-2))*x1var + ((n2-1)/(n1+n2-2))*x2var

t0 = (x1bar-x2bar)/sqrt(pooled_var*((1/n1)+(1/n2)))

df = n1+n2-2

pval = pt(t0,df=df)

print(pooled_var)
print(df)
print(t0)
print(pval)

We first computed the pooled variance s^{2} and, using it, the test-statistic t_{0}. Then, based on the degrees of freedom, we compute the p-value.

The p-value is 0.000736. For a two-sided test, we compare this with \frac{\alpha}{2}. If we opt for a rejection rate \alpha of 5%, then, since our p-value of 0.000736 is less than 0.025, we reject the null hypothesis that the means are equal.

All these steps can be implemented using a one-line command in R.

Try this:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = TRUE)

You will get the following prompt in the console.

Two Sample t-test

data: feb_refuse_data and aug_refuse_data

t = -3.3366, df = 59, p-value = 0.001472

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -658.2825 -164.7239

sample estimates:

mean of x mean of y 

 2510.697 2922.200 

As inputs of the function, we need to provide the data (sample 1 and sample 2). We should also indicate that we evaluate a two-sided alternate hypothesis and that the population variances are equal. This will prompt the function to execute the standard two-sample t-test.

While the function provides additional information, all we need for our test are pointers from the first line. t=-3.3366 and df=59 are the test-statistic and the degrees of freedom that we computed earlier. The p-value looks different than what we computed. It is two times the value we estimated. The function provides a value that is double that of the original p-value to allow us to compare it to \alpha instead of \frac{\alpha}{2}. Either way, we know that we can reject the null hypothesis.


Next, let’s implement Welch’s t-Test. When the population variances are not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, the test-statistic is t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}, and it follows an approximate T-distribution with f degrees of freedom which can be estimated using the Satterthwaite’s equation: f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}

Execute the following lines.

f = (((x1var/n1)+(x2var/n2))^2)/(((x1var/n1)^2/(n1-1))+((x2var/n2)^2/(n2-1)))

t0 = (x1bar-x2bar)/sqrt((x1var/n1)+(x2var/n2))

pval = pt(t0,df=f)

print(f)
print(t0)
print(pval)

The pooled degrees of freedom is 55, the test-statistic is -3.3528, and the p-value is 0.000724. Comparing this with 0.025, we can reject the null hypothesis.

Of course, all this can be done using one line where you indicate that the variances are not equal:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = FALSE)

When you execute the above line, you will see the following on your console.

Welch Two Sample t-test
data: feb_refuse_data and aug_refuse_data
t = -3.3528, df = 55.338, p-value = 0.001448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-657.4360 -165.5705
sample estimates:
mean of x mean of y
2510.697 2922.200

Did you notice that it runs Welch’s two-sample t-Test when we indicate that the variances are not equal? Like before, we can compare the p-value to 0.05 and reject the null hypothesis.

The mean of the February refuse collected from Manhattan’s community district 9 is different from August. Since the test-statistic is negative and the p-value is lower than the chosen level of rejection rate, we can conclude that the mean of February refuse tonnage is significantly lower than the mean of August refuse tonnage beyond a reasonable doubt.


How about Wilcoxon’s Rank-sum Test? The null hypothesis is based on the proposition that if x_{1} and x_{2} are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

H_{0}: P(x_{1} > x_{2}) = 0.5

H_{0}: P(x_{1} > x_{2}) \neq 0.5

We know that this ranking and coming up with the rank-sum table is tedious for larger sample sizes. For larger sample sizes, the null distribution of the test-statistic W, which is the sum of the ranks associated with the variable of smaller sample size in the pooled and ordered data, tends to a normal distribution. This allows the calculation of the p-value by comparing W to a normal distribution. We can use the following command in R to implement this.

wilcox.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will get the following message in the console when you execute the above line.

Wilcoxon rank sum test with continuity correction
data: feb_refuse_data and aug_refuse_data
W = 248, p-value = 0.001788
alternative hypothesis: true location shift is not equal to 0

Since the p-value of 0.001788 is lower than 0.05, the chosen level of rejection, we reject the null hypothesis that x_{1} and x_{2} are samples from the same distribution.


Finally, let’s take a look at the bootstrap method. The null hypothesis is that there is no difference between the means. 

H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) = 0.5

H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) \neq 0.5

We first create a bootstrap replicate of X and Y by randomly drawing with replacement n_{1} values from X and n_{2} values from Y.

For each bootstrap replicate from X and Y, we compute the statistics \bar{x}_{Feb} and \bar{x}_{Aug} and check whether \bar{x}_{Feb}>\bar{x}_{Aug}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics \bar{x}_{Feb} and \bar{x}_{Aug}, and verifying whether \bar{x}_{Feb}>\bar{x}_{Aug} and registering S_{i} \in (0,1) a large number of times, say N=10,000.

The proportion of times S_{i} = 1 in a set of N bootstrap-replicated statistics is the p-value.

Execute the following lines in R to implement these steps.

#4. Bootstrap
  N = 10000
  null_mean = matrix(0,nrow=N,ncol=1)
  null_mean_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_mean_ratio[i] = mean(xboot)/mean(yboot) 
     if(mean(xboot)>mean(yboot)){null_mean[i]=1} 
   }
 
 pvalue_mean = sum(null_mean)/N
 hist(null_mean_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="Xbar/Ybar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(0.95,1000,paste("p-value=",pvalue_mean),col="red")

Your RStudio interface will then look like this:

A vertical bar will show up at a ratio of 1 to indicate that the area beyond this value is the proportion of times S_{i} = 1 in 10,000 bootstrap-replicated means.

The p-value is close to 0. We reject the null hypothesis since the p-value is less than 0.025. Since more than 97.5% of the times, the mean of February tonnage is less than August tonnage; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

In summary, while the standard t-Test, Welch's t-Test, and Wilcoxon's Rank-sum Test can be implemented in R using single-line commands,

t.test(x1,x2,alternative="two.sided",var.equal = TRUE)
t.test(x1,x2,alternative="two.sided",var.equal = FALSE)
wilcox.test(x1,x2,alternative = "two.sided")

the bootstrap test needs a few lines of code.

Hypothesis Test on the Equality of Variances

We learned two methods to verify the hypothesis on the equality of variances; using F-distribution and using the bootstrap method.

For the F-distribution method, the null and the alternate hypothesis are 

H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}

H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}

The test-statistic is the ratio of the sample variances, f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic f_{0} in the null distribution.

Execute the following lines.

# 1. F-Test
  f0 = x1var/x2var
  
  df_numerator = n1-1
  
  df_denominator = n2-1
  
  pval = 1-pf(f0,df1=df_numerator,df2=df_denominator)
  
  print(f0)
  print(df_numerator)
  print(df_denominator)
  print(pval)

The test-statistic, i.e., the ratio of the sample variances, is 1.814.

The p-value, which, in this case, is the probability of finding a value greater than the test-statistic (P(F > f_{0})), is 0.0562. When compared to a one-sided rejection rate (\frac{\alpha}{2}) of 0.025, we cannot reject the null hypothesis

We do have a one-liner command.

var.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will find the following message when you execute the var.test() command.

F test to compare two variances
data: feb_refuse_data and aug_refuse_data
F = 1.814, num df = 30, denom df = 29, p-value = 0.1125
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8669621 3.7778616
sample estimates:
ratio of variances
1.813959

As usual, with the standard R function, we need to compare the p-value (which is two times what we computed based on the right-tail probability) with \alpha=0.05. Since 0.1125 > 0.05, we cannot reject the null hypothesis that the variances are equal.


Can we come to the same conclusion using the bootstrap test? Let’s try. Execute the following lines in R.

# 2. Bootstrap
 N = 10000
   null_var = matrix(0,nrow=N,ncol=1)
   null_var_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_var_ratio[i] = var(xboot)/var(yboot) 
     if(var(xboot)>var(yboot)){null_var[i]=1} 
   }
 
 pvalue_var = sum(null_var)/N
 hist(null_var_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="XVar/YVar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,500,paste("p-value=",pvalue_var),col="red")

Your RStudio interface should look like this if you correctly coded up and executed the lines. The code provides a way to visualize the null distribution.

The p-value is 0.7945. 7945 out of the 10,000 bootstrap replicates had s^{2}_{Feb}>s^{2}_{Aug}. For a 5% rate of error (\alpha=5\%), we cannot reject the null hypothesis since the p-value is greater than 0.025. The evidence (20.55% of the times) that the February tonnage variance is less than August tonnage is not sufficient to reject equality.

In summary, use var.test(x1,x2,alternative = "two.sided") if you want to verify based on F-distribution, or code up a few lines if you want to verify using the bootstrap method.

Hypothesis Test on the Difference in Proportions

For the hypothesis test on the difference in proportions, we can employ Fisher’s Exact Test, use the normal approximation under the large-sample assumption, or the bootstrap method.

For Fisher’s Exact Test, the test-statistic follows a hypergeometric distribution when H_{0} is true. We could assume that the number of successes is fixed at t=x_{1}+x_{2}, and, for a fixed value of t, we reject H_{0}:p_{1}=p_{2} for the alternate hypothesis H_{A}:p_{1}>p_{2} if there are more successes in random variable X_{1} compared to X_{2}.

The p-value can be derived under the assumption that the number of successes X=k in the first sample X_{1} has a hypergeometric distribution when H_{0} is true and conditional on a total number of t successes that can come from any of the two random variables X_{1} and X_{2}.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

Execute the following lines to implement this process, plot the null distribution and compute the p-value.

#Fisher's Exact Test
  x1 = length(which(feb_refuse_data < p_threshold))
  p1 = x1/n1
  
  x2 = length(which(aug_refuse_data < p_threshold))
  p2 = x2/n2
  
  N = n1+n2
  t = x1+x2
  
  k = seq(from=0,to=n1,by=1)
  p = k
  for(i in 1:length(k)) 
   {
    p[i] = (choose(t,k[i])*choose((N-t),(n1-k[i])))/choose(N,n1)
   }

  plot(k,p,type="h",xlab="Number of successes in X1",ylab="P(X=k)",font=2,font.lab=2)
  points(k,p,type="o",lty=2,col="grey50")
  points(k[13:length(k)],p[13:length(k)],type="o",col="red",lwd=2)
  points(k[13:length(k)],p[13:length(k)],type="h",col="red",lwd=2)
  pvalue = sum(p[13:length(k)])
  print(pvalue)

Your RStudio should look like this once you run these lines.

The p-value is 0.1539. Since it is greater than the rejection rate, we cannot reject the null hypothesis that the proportions are equal.

We can also leverage an in-built R function that does the same. Try this.

fisher_data = cbind(c(x1,x2),c((n1-x1),(n2-x2)))
fisher.test(fisher_data,alternative="greater")

In the first line, we create a 2×2 contingency table to indicate the number of successes and failures in each sample. Sample 1 has 12 successes, i.e., 12 times the tonnage in February was less than 2500 tons. Sample 2 has seven successes. So the contingency table looks like this:

    [,1]  [,2]
[1,] 12    19
[2,]  7    23

Then, fisher.test() funtion implements the hypergeometric distribution calculations.

You will find the following prompt in your console.

Fisher’s Exact Test for Count Data
data: fisher_data
p-value = 0.1539
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.712854 Inf
sample estimates:
odds ratio
2.050285

We get the same p-value as before — we cannot reject the null hypothesis.


Under the normal approximation for large sample sizes, the test-statistic z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1). p is the pooled proportion which can be compute as p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}.

We reject the null hypothesis when the p-value P(Z \ge z) is less than the rate of rejection \alpha.

Execute the following lines to set this up in R.

# Z-approximation
  p = (x1+x2)/(n1+n2)
  z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2)))
  pval = 1-pnorm(z)

The p-value is 0.0974. Since it is greater than 0.05, we cannot reject the null hypothesis.


Finally, we can use the bootstrap method as follows.

# Bootstrap
 N = 10000
 null_prop = matrix(0,nrow=N,ncol=1)
 null_prop_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
 
     p1boot = length(which(xboot < p_threshold))/n1 
     p2boot = length(which(yboot < p_threshold))/n2 

     null_prop_ratio[i] = p1boot/p2boot 
    if(p1boot>p2boot){null_prop[i]=1} 
  }
 
 pvalue_prop = sum(null_prop)/N
 hist(null_prop_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="P1/P2",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,250,paste("p-value=",pvalue_prop),col="red")

We evaluated whether the low trash situation in February is different than August.

H_{0}: P(p_{Feb}>p_{Aug}) = 0.5

H_{A}: P(p_{Feb}>p_{Aug}) > 0.5

The p-value is 0.8941. 8941 out of the 10,000 bootstrap replicates had p_{Feb}>p_{Aug}. For a 5% rate of error (\alpha=5\%), we cannot reject the null hypothesis since the p-value is not greater than 0.95. The evidence (89.41% of the times) that the low trash situation in February is greater than August is insufficient to reject equality.

Your Rstudio interface will look like this.

In summary, 
use fisher.test(contingency_table,alternative="greater") 
if you want to verify based on Fisher's Exact Test; 

use p = (x1+x2)/(n1+n2); 
    z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2))); 
    pval = 1-pnorm(z) 
if you want to verify based on the normal approximation; 

code up a few lines if you want to verify using the bootstrap method.

  1. Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
    • Reject the null hypothesis. They are different.
  2. Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
    • Cannot reject the null hypothesis. They are probably the same.
  3. Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?
    • Cannot reject the null hypothesis. They are probably the same.

Please take the next two weeks to digest all the two-sample hypothesis tests, including how to execute them in R.

Here is the full code for today’s lesson.

In a little over four years, we reached 99 lessons 😎

I will have something different for the 100. Stay tuned …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-Sample Hypothesis Tests – Part VII

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “Open Data for All New Yorkers” page.

Interestingly, I found data on the bicycle counts for East River Bridges. New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

They highlight all non-holiday weekdays with no precipitation in yellow.

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

We are looking at the boxplots that present a nice visual of the data range and its percentiles. And we can compare one sample to another. Remember Lesson 14? There is a vertical line at 6352 bikes, the maximum number of bikes on Manhattan Bridge during weekends, holidays, or rainy days — i.e., the non-highlighted days.

I want answers to the following questions.

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \bar{x}_{M}=\bar{x}_{W}?
Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \tilde{x}_{M}=\tilde{x}_{W}?
Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. s^{2}_{M}=s^{2}_{W}?
Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. IQR_{M}=IQR_{W}?
Is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge. P(M<6352)=P(W<6352) or p_{M}=p_{W}?

What do we know so far? 

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal (Lesson 94) or using Welch’s t-Test when we cannot assume equality of population variances (Lesson 95). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means (Lesson 96).

We know how to test the equality of variances using F-distribution (Lesson 97).

We know how to test the difference in proportions using either Fisher’s Exact test (Lesson 92) or using the normal distribution as the null distribution under the large-sample approximation (Lesson 93).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.

  • What is the limiting distribution of the test-statistic that computes the difference in medians?
  • What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
  • What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

The idea of the bootstrap is that we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely with a probability of 1/n, we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Take the data for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values — with replacement.

It is like playing the game of Bingo where the chips are these eight numbers. Each time we get a number, we put it back and roll it again until we draw eight numbers.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate.
6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate.
6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution \hat{f}, i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating \hat{f}, the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic. As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

Let’s now use the bootstrap method for two-sample hypothesis tests. Suppose there are two random variables, X and Y, and any statistic computed from them are \theta_{x} and \theta_{y}. We may have a sample of n_{1} values representing X and a sample of n_{2} values to represent Y.

\theta_{x},\theta_{y} can be mean, median, variance, proportion, etc. Any computable statistic from the original data is of the form \theta_{x},\theta_{y}.

The null hypothesis is that there is no difference between the statistic of X or Y.

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

The alternate hypothesis is

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

We first create a bootstrap replicate of X and Y by randomly drawing with replacement n_{1} values from X and n_{2} values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics \theta_{x} and \theta_{y} and check whether \theta_{x}>\theta_{y}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

xboot: 6691 5311 6774 5311 6359 5311 5311 6052
yboot: 6775 6881 7341 7196 6775 7341 6775 7196

Mean of this bootstrap replicate for X and Y are 5890 and 7035. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

Another bootstrap replicate for X and Y may look like this:

xboot: 6774 6359 6359 6359 6052 5054 6052 6691
yboot: 6775 7196 7196 6026 6881 7341 6881 5711

Mean of this bootstrap replicate for X and Y are 6212.5 and 6750.875. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics \theta_{x} and \theta_{y}, and verifying whether \theta_{x}>\theta_{y} and registering S_{i} \in (0,1) a large number of times, say N=10,000.

The proportion of times S_{i} = 1 in a set of N bootstrap-replicated statistics is the p-value.

p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}

Based on the p-value, we can use the rule of rejection at the selected rate of rejection \alpha.

For a one-sided alternate hypothesis, we reject the null hypothesis if p-value < \alpha or p-value > 1-\alpha.

For a two-sided alternate hypothesis, we reject the null hypothesis if p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}.

Manhattan Bridge vs. Williamsburg Bridge

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5

H_{A}: P(\bar{x}_{M}> \bar{x}_{W}) \neq 0.5

Let’s take a two-sided alternate hypothesis.

Here is the null distribution of \frac{\bar{x}_{M}}{\bar{x}_{W}} for N = 10,000.

A vertical bar is shown at a ratio of 1 to indicate that the area beyond this value is the proportion of times S_{i} = 1 in a set of 10,000 bootstrap-replicated means.

The p-value is 0.0466. 466 out of the 10,000 bootstrap replicates had \bar{x}_{M}>\bar{x}_{W}. For a 10% rate of error (\alpha=10\%), we reject the null hypothesis since the p-value is less than 0.05. Since more than 95% of the times, the mean of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

Can we reject the null hypothesis if we select a 5% rate of error?



Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5

H_{A}: P(\tilde{x}_{M}> \tilde{x}_{W}) \neq 0.5

The null distribution of \frac{\tilde{x}_{M}}{\tilde{x}_{W}} for N = 10,000.

The p-value is 0.1549. 1549 out of the 10,000 bootstrap replicates had \tilde{x}{M}>\tilde{x}{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (84.51% of the times) that the median of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(s^{2}_{M}>s^{2}_{W}) = 0.5

H_{A}: P(s^{2}_{M}>s^{2}_{W}) \neq 0.5

The null distribution of \sqrt{\frac{s^{2}_{M}}{s^{2}_{W}}} for N = 10,000. We are looking at the null distribution of the ratio of the standard deviations.

The p-value is 0.4839. 4839 out of the 10,000 bootstrap replicates had s^{2}_{M}>s^{2}_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (51.61% of the times) that the variance of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(IQR_{M}>IQR_{W}) = 0.5

H_{A}: P(IQR_{M}>IQR_{W}) \neq 0.5

The null distribution of \frac{IQR_{M}}{IQR_{W}} for N = 10,000. It does not resemble any known distribution, but that does not restrain us since the bootstrap-based hypothesis test is distribution-free.

The p-value is 0.5453. 5453 out of the 10,000 bootstrap replicates had IQR_{M}>IQR_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (45.47% of the times) that the interquartile range of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Finally, is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge?

H_{0}: P(p_{M}>p_{W}) = 0.5

H_{A}: P(p_{M}>p_{W}) \neq 0.5

The null distribution of \frac{p_{M}}{p_{W}} for N = 10,000.

The p-value is 0.5991. 5991 out of the 10,000 bootstrap replicates had p_{M}>p_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (40.09% of the times) that the proportion of the total bike counts less than 6352 on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent. 

To summarize,

Repeatedly sample with replacement from original samples of X and Y -- N times.
Each time draw a sample of size n_{1} from X and a sample of size n_{2} from Y.
Compute the desired statistic (mean, median, skew, etc.) from each
bootstrap sample.
The null hypothesis P(\theta_{x}>\theta_{y})=0.5 can now be tested as follows:
  • S_{i}=1 if \theta_{x}>\theta_{y}, else, S_{i}=0
  • p-value=\frac{1}{N}\sum_{i=1}^{N}S_{i} (average over all N bootstrap-replicated statistics)
  • If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis for a two-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value < \alpha, reject the null hypothesis for a left-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value > 1-\alpha, reject the null hypothesis for a right-sided hypothesis test at a selected rejection rate of \alpha.

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 97 – The Two-Sample Hypothesis Test – Part VI

On the Equality of Variances

using F-distribution

H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}

H_{A}:\sigma_{1}^{2}>\sigma_{2}^{2}

H_{A}:\sigma_{1}^{2}<\sigma_{2}^{2}

H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}

A good friend is a wall street day trader who has been trading BTC. Lately, he has observed higher volatility in the daily gains/losses, which he associates with celebrities and big banks’ newfound interest. Perhaps they are late to the party, or they quietly got in and are pumping it up.

We wanted to conduct a fun exercise to verify whether the recent period’s variability is more significant than a preceding period. We downloaded BTC price data for the past month and computed daily gains or losses as the difference between closing and opening prices. Here is how the gains/losses look like over the past month.

We highlight two samples, the gains/losses for the most recent week in red and the gains/losses for the week preceding it in blue.

The data from the first sample, i.e., BTC gains/losses for the most recent week, is:
$256, $7991, -$900, -$1357, $2496, $541, -$1094

The variability measured using standard-deviation of the seven values is $3299.

The data from the second sample, i.e., BTC gains/losses for the week preceding the current week, is:
-$429, $323, $2774, $1024, -$393, $986, -$943

Its variability is $1374.

At first glance, it seems evident that the variability in sample 1 is greater than sample2.

But, we know that glances and feelings are not enough. 

We need to establish evidence beyond a reasonable doubt. 

So my friend and I set up a hypothesis testing framework on the equality of variances.

The null hypothesis is that the population variances are equal.

H_{0}: \sigma_{1}^{2}=\sigma_{2}^{2}

To counter this, we write an alternate hypothesis that

H_{A}:\sigma_{1}^{2}>\sigma_{2}^{2}

Then, suppose we know the test-statistic and the null distribution, i.e., the limiting distribution that the test-statistic converges to. In that case, we can compute the p-value and reject H_{0} if it is less than our chosen rate of rejection, i.e., if p-value < \alpha. You and I know the whole process. 

Do you know what the test-statistic is?

.

.

.

Refer to Lesson 73. Can we say that \frac{(n-1)s^{2}}{\sigma^{2}} is a Chi-square distribution with (n-1) degrees of fredom?

So we can write,

\frac{(n-1)s^{2}}{\sigma^{2}}=\chi^{2}

For the two samples which have n_{1} and n_{2} values, we can write the following two equations.

\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}}=\chi_{1}^{2}

\frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}}=\chi_{2}^{2}

If we take their ratio, we get

\frac{\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}}}{\frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}}} = \frac{\chi_{1}^{2}}{\chi_{2}^{2}}

Let’s move the degrees of freedom terms to the right-hand side.

\frac{s_{1}^{2}/\sigma_{1}^{2}}{s_{2}^{2}/\sigma_{2}^{2}} = \frac{\chi_{1}^{2}/(n_{1}-1)}{\chi_{2}^{2}/(n_{2}-1)}

The ratio that we see on the right-hand side is the ratio of two Chi-square distributions divided by their respective degrees of freedom.

This ratio, \frac{\chi_{1}^{2}/(n_{1}-1)}{\chi_{2}^{2}/(n_{2}-1)} happens to be a special distribution called the F-distribution as presented by George W. Snedecor.

Snedecor named after Sir R.A. Fisher (F-distribution), who first introduced the ratio of sample variances for evaluating whether they are significantly different. He introduced it as the general z-distribution.

The probability distribution function of an F-distribution is:

f_{F}(f) = \frac{(\frac{n_{1}+n_{2}}{2}-1)!(\frac{n_{1}}{n_{2}})^{(\frac{n_{1}}{2})}f^{(\frac{n_{1}}{2}-1)}}{(\frac{n_{1}}{2}-1)!(\frac{n_{2}}{2}-1)!(1+\frac{n_{1}}{n_{2}}f)^{\frac{n_{1}+n_{2}}{2}}}; \hspace{5} 0 < f < \infty

In a later lesson, we will derive this from the first principles. We need to know some concepts on joint distributions to be able to do that. So acknowledge it for now, but look out for its derivation in one of the future lessons.

You can say that the F-distribution is a non-negative distribution since it is the ratio of the two non-negative Chi-square distributions. It is denoted as F_{df1,df2}, i.e., an F-distribution with df1 degrees of freedom in the numerator (corresponding to \chi_{1}^{2}) and df2 degrees of freedom in the denominator (corresponding to \chi_{2}^{2}).

Visually, it looks similar to the Chi-square distribution. See two examples here.

The distribution shown in black has six degrees of freedom, each in the numerator and the denominator. The distribution shown in red has six degrees of freedom in the numerator and 12 degrees of freedom in the denominator. The one with lower degrees of freedom has a fatter tail and more variance — to factor in more uncertainty due to smaller sample sizes.

At any rate, we can now see that

\frac{s_{1}^{2}/\sigma_{1}^{2}}{s_{2}^{2}/\sigma_{2}^{2}} = F_{n_{1}-1,n_{2}-1}

Now, if the null hypothesis is true, we have \sigma_{1}^{2}=\sigma_{2}^{2}; which means, \frac{s_{1}^{2}}{s_{2}^{2}} = F_{n_{1}-1,n_{2}-1}

So the test-statistic is the ratio of the sample variances.

f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic f_{0} in the null distribution.

For our BTC exercise, the alternate hypothesis H_{A} is \sigma_{1}^{2} > \sigma_{2}^{2}. So, we check how far away from unity is our test-statistic f_{0}, the ratio of the sample variances.

Suppose it lies in the critical rejection region based on our choice of \alpha (the rate of rejection); in that case, the p-value will be small, allowing us to reject the null hypothesis that the population variances are equal.

Let’s try it.

The sample variance, s_{1}^{2}, for the first sample, i..e, BTC gains/losses in the most recent week, is

s_{1}^{2} = \frac{1}{n_{1}-1}\sum_{i=1}^{i=n_{1}}(x_{i}-\bar{x})^{2}=10,884,592

The sample variance for the second sample (BTC gains/losses for the week preceding the current week) is

s_{2}^{2} = \frac{1}{n_{2}-1}\sum_{i=1}^{i=n_{2}}(x_{i}-\bar{x})^{2}=1,887,768

Their ratio, i.e., the test-statistic f_{0} is 5.77.

Since each sample has seven values, one value per day, the null distribution assuming that H_{0} is true is an F-distribution with six numerator and denominator degrees of freedom (F_{6,6}).

It looks like this.

As we can see, the test-statistic (f_{0}=5.77) is in the rejection region if we opt for a rejection rate of \alpha=5\%. The critical value at the 5% level is f_{critical}=4.28. Any value beyond this critical value indicates a deviation far enough from unity to reject the null hypothesis that the variances are equal.

Most statistics textbooks provide a tabulated version of f_{critical} for various \alpha values. Here is an example of how such a table looks for \alpha=5\%. It is a snapshot from Snedecor’s book entitled “Statistical Methods,” published in 1937-1938. Rows correspond to numerator degrees of freedom, and columns correspond to denominator degrees of freedom.

We can also compute the p-value, which, in this case, is the probability of finding a value greater than the test-statistic (P(F > f_{0})), from standard computer-based statistics programs. 

In our case, the p-value is 0.025, lower than the chosen rate of rejection of 5%. There is a less than 5 in 100 chance of drawing a sample having a larger F value from the null distribution at hand.

So we reject the null hypothesis that  \sigma_{1}^{2}=\sigma_{1}^{2}. The population variances are unequal. The variability of the recent period is greater than the variability of the preceding period. 

Perhaps the small, most recent sample is not enough to deduce the conclusion on the changes? Go ahead and test for yourself using more data. CoinDesk has data that goes back as far as October 2013, when the price of BTC was $123.

Will the music stop? Does the party come to an end? How much bitcoin do you have?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 96 – Wilcoxon’s Rank-sum Test

Part V of the Two-Sample Hypothesis Tests

H_{0}: P(x_{1}>x_{2})=0.5

H_{A}: P(x_{1}>x_{2})>0.5

H_{A}: P(x_{1}>x_{2})<0.5

H_{A}: P(x_{1}>x_{2}) \neq 0.5

United Airlines is the best flight service out there. They always arrived on time or earlier when I traveled with them.

How can you say that? Did you compare it to any other airline? I can claim that Jet Blue is the best flight service. I traveled with them five times; the flight arrived twice ahead of schedule, and the most lengthy delay was only 15 minutes.

I seem to be a clear winner here. My delays are -4 and 0 minutes. 😎

My delays are -1, -3, -2, 15 and 8 minutes. Can you still claim you are the winner?

You seem to be bathing in hypothesis tests, and you know the idea of “beyond statistical doubt.”

Can you prove that United flights tend to arrive earlier than Jet Blue flights? Or rather, can you disprove that there is no significant difference between the delays of United flights and Jet Blue flights?

I can use the standard version of the two-sample t-Test or the version proposed by Welch to verify whether the difference in mean delays is significant or not.

Okay. But do you think it is reasonable to compute the sample statistics of mean and variance on such a small sample? At least I have a sample of five. You only have two.

😕 😕 😕

We could employ the rank-sum test developed by Frank Wilcoxon in 1945.

Go on…

Wilcoxon, in his paper titled “Individual Comparisons by Ranking Methods,” indicated the possibility of using ranking methods to approximate the significance of the differences in means. By ranking methods, he meant to replace the actual numerical data with their ranks.

How exactly is it done?

Let’s employ our small data samples and understand the logic and the procedure of the Wilcoxon’s Rank-sum Test.

But we need to start with defining the null and the alternate hypotheses. Take the delays of United flights to be x_{1} and Jet Blue’s delays to be x_{2}. As the null distribution, we assume that x_{1} and x_{2} are samples from the same distribution.

Then, P(x_{1}>x_{2})=0.5 since they will be intermingled. In other words, if x_{1} and x_{2} are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

Let me reason out the alternate hypothesis. If the null hypothesis is not true, then we could have P(x_{1}>x_{2})>0.5 if the delays of United flights are generally greater than Jet Blue. More observations of United will be towards the right of the observations from Jet Blue if x_{1} is from a distribution that is generally greater than x_{2}.

We could have P(x_{1}>x_{2})<0.5 if the delays of United flights are generally smaller than Jet Blue. More observation of United will be towards the left of (or below) the observations of Jet Blue if x_{1} is from a distribution that is generally lower than x_{2}.

Or, we could say P(x_{1}>x_{2}) \neq 0.5 to establish the fact that United delays are different than Jet Blue delays.

Great. Let me summarize them.

H_{0}: P(x_{1}>x_{2})=0.5

H_{A}: P(x_{1}>x_{2})>0.5

H_{A}: P(x_{1}>x_{2})<0.5

H_{A}: P(x_{1}>x_{2}) \neq 0.5

Now, let’s pool all the data into a single set of seven values and order them from smallest to the largest. I will use red color to write the United’s delay times to know that this is from a different sample.

United
-4, 0
Jet Blue
-1, 3, -2, 15, 8


Pooled and ordered
-4, -2, -1, 0, 3, 8, 15

Can you look at the pooled and ordered data and tell me what the rank of -4 is?

That would be rank 1. The smallest value.

Right. And the rank of 15, the largest value in the pooled set, is 7. When two or more values are the same, then we assign the mean rank to all those values.

Would you agree that if either most of the smallest ranks or most of the largest ranks are coming from United flights, i.e., sample x_{1}, there would be more substantial evidence to reject the null hypothesis?

That is sensible.

So we can think of the sum of the ranks that correspond to values from x_{1} as the test-statistic. In our hypothesis test, the sum of the ranks of the values from United flights is 1 + 4 = 5. United flights have delays of -4 and 0 minutes. These values, in the pooled and ordered data, have ranks of 1 and 4. Their rank-sum is 5.

Test-statistic W = sum of the ranks associated with x_{1}, the variable with a smaller sample size, in the pooled and ordered data.

I get it. We should now check how likely it is to see a rank-sum value of 5 in the null distribution of the rank-sums under the assumption that H_{0} is true.

Yes, that is correct. Let’s outline all possible values of the rank-sums.

We have 2 values from x_{1} and 5 values from x_{2}. If the null hypothesis is true, i.e., if x_{1} and x_{2} are samples from the same distribution, the values will be intermingled and the values in x_{1} can take the following ranks.

Nice table. I can see that the smallest possible rank-sum is 3 when the two values in x_{1} take a rank of 1 and 2. If they are intermingled, it is also equally likely that they take on ranks of 1 and 7, in which case, the rank-sum will be 8.

Yup. Like this, I have written down all possible rank-combinations — from 1-2 that gives a rank-sum of 3 to 6-7 that gives a rank-sum of 13.

There are 21 such combinations.

We can also get the total combinations as \binom{7}{2}. Seven values are arranged two at a time.

Right. More generally, if there are n_{1} values in x_{1} and n_{2} values in x_{2}, the total rank-combinations are \binom{n_{1}+n_{2}}{n_{1}}.

From here, let me deduce the null distribution of the rank-sums. From the table, I can see that the possible values of rank-sums are 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13.

The rank-sum values of 3 and 4 occur only one time.
The rank-sum value of 5 occurs two times.
The rank-sum value of 6 occurs two times.
The rank-sum value of 7 occurs three times.
The rank-sum value of 8 occurs three times.
The rank-sum value of 9 occurs three times.
The rank-sum value of 10 occurs two times.
The rank-sum value of 11 occurs two times.
The rank-sum values of 12 and 13 occur one time.

Here, I also prepared a summary table to shows the frequency, probability, and cumulative probability.

The null distribution looks like this.

Very nicely done! You are showing the possible values of the test-statistic (W) on the x-axis and their relative frequency or the probability P(W) on the y-axis. W ranges from 3 to 13.

More generally, the possible values of W range from \frac{n_{1}(n_{1}+1)}{2} (the smallest rank-sum) to \frac{n_{1}(n_{1}+2n_{2}+1)}{2} (the largest rank-sum).

How did you get that? 😕

The smallest possible rank-sum will occur when all n_{1} values of x_{1} are to the left of x_{2}. It is the sum of the first n_{1} ranks — \frac{n_{1}(n_{1}+1)}{2}.

In the same way, the largest possible rank-sum will occur when all n_{1} values of x_{1} are to the right of x_{2}. To get the rank-sum of the last n_{1} values, we should first compute the rank-sum of the whole set of n_{1}+n_{2} values and subtract the rank-sum of n_{2} values from it.

In our example, the largest rank-sum is 6+7=13, which is the sum of the last two ranks. We can get the sum of 13 by first getting the full rank-sum 1+2+3+4+5+6+7 = 28 and then subtracting the rank-sum of the first five values: 1+2+3+4+5 = 15.

The full rank-sum is the sum of n_{1}+n_{2} integers = \frac{(n_{1}+n_{2})(n_{1}+n_{2}+1)}{2}

The rank-sum of n_{2} values is \frac{n_{2}(n_{2}+1)}{2}. This means that n_{1} values are on the right side (in the pooled and ordered values), and we are only computing the rank-sum of the remaiming n_{2} values.

When we subtract this rank-sum from the full rank-sum we get \frac{n_{1}(n_{1}+2n_{2}+1)}{2}

Ah! That is clever.

Can you also see in your null distribution table, a symmetry around the value of \frac{3+13}{2}=8?

Yes.

Can you get a generalized version of this mid-point rank-sum?

It is the mid-point of the smallest rank-sum of \frac{n_{1}(n_{1}+1)}{2} and the largest rank-sum of \frac{n_{1}(n_{1}+2n_{2}+1)}{2}

0.5*(\frac{n_{1}(n_{1}+1)}{2}+\frac{n_{1}(n_{1}+2n_{2}+1)}{2})=\frac{n_{1}(n_{1}+n_{2}+1)}{2}

Nice! The generalization of the frequency for each rank-sum can also be derived like this. Wilcoxon showed it in his paper.

At any rate, the observed test-statistic for our experiment is 5. So we compute the probability of finding a value as large as 5 — P(W \le 5), in the null distribution. It turns out to be 0.1905.

If we opt for a 5% rate of rejection, we cannot reject the null hypothesis that x_{1} and x_{2} are samples from the same distribution. It is the case even for a 10% rate of rejection. The evidence against the null hypothesis is not convincing.

I will concede 🙁

But I like this method. It seems to have no assumptions on the limiting distributions, is a data-driven approach, and works very well for small sample sizes. The technique is a little tedious for larger sample sizes, isn’t it?

Yes. It is rather tedious to come up with the rank-sum table each time. But there are standardized lookup tables for it. There is also a large-sum approximation to a normal distribution. As you noticed, the null distribution is symmetric. It tends to a normal distribution for larger sample sizes, and we can derive the appropriate formula for the test-statistic. Most statistical programs use this approximation. We will look at how it is done in RStudio at a later time.

Did you notice that your rank-sum should have been 3 for you to reject the null hypothesis at the 5% rejection rate?

Ah. That is correct. It means that United Airlines’ delays should have always been lower than Jet Blue’s delays to disprove that there is no significant difference between the delays of United flights and Jet Blue flights.

When we are applying this test on small sample sizes, it seldom results in a highly compelling case for rejecting the null hypothesis H_{0}. If we add more data, the null distribution smoothens out, and it would be possible to attain lower p-values, which provides a stronger case against H_{0}.

By the way, it looks like we are competing for the last prize. Based on the Department of Transportation’s on-time ranking, United and Jet Blue are close to last. You have some brownie points in claiming that United is better than Jet blue.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 95 – The Two-Sample Hypothesis Test – Part IV

On the Difference in Means

using Welch’s t-Test

H_{0}: \mu_{1} - \mu_{2} = 0

H_{A}: \mu_{1} - \mu_{2} > 0

H_{A}: \mu_{1} - \mu_{2} < 0

H_{A}: \mu_{1} - \mu_{2} \neq 0

On the 24th day of January 2021, we examined Tom’s hypothesis on the Mohawk Rivers’ arsenic levels.

After a lengthy expose on the fundamentals behind hypothesis testing on the difference in means using a two-sample t-Test, we concluded that Tom could not reject the null hypothesis H_{0}: \mu_{1}-\mu_{2}=0.

He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

However, Tom now has a new theory that the factory is illegally dumping their untreated waste into both the west and the east branches of the Mohawk River. So, he took Ron with him to collect data from Utica River, a tributary of Mohawk that branches off right before the factory.

If his new theory is correct, he should find that the mean arsenic concentration in either west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

There are now three samples whose concentration in parts per billion are:

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
n_{1}=10 | \bar{x_{1}} = 12.5 | s_{1}^{2} = 58.28

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
n_{2}=10 | \bar{x_{2}} = 12 | s_{2}^{2} = 54.89

Utica River: 4, 4, 6, 4, 5, 7, 8
n_{3}=7 | \bar{x_{3}} = 5.43 | s_{3}^{2} = 2.62

Were you able to help Tom with his new hypothesis?

In his first hypothesis test, since the sample variances were close to each other (58.28 and 54.89), we assumed that the population variances are equal and proceeded with a t-Test.

Under the proposition that the population variance of two random variables X_{1} and X_{2} are equal, i.e., \sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}, the test-statistic is t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the pooled variance. t_{0} follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom.

But, can we make the same assumption in the new case? The Utica River sample has a sample variance s_{3}^{2} of 2.62. It will not be reasonable to assume that the population variances are equal.

How do we proceed when \sigma_{1}^{2} \neq \sigma_{2}^{2}?

Let’s go back a few steps and outline how we arrived at the test-statistic.

The hypothesis test is on the difference in means: \mu_{1} - \mu_{2}

A good estimator of the difference in population means is the difference in sample means: y = \bar{x_{1}} - \bar{x_{2}}

The expected value of y, E[y], is \mu_{1}-\mu_{2}, and the variance of y, V[y], is \frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}.

Since y \sim N(\mu_{1}-\mu_{2},\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}), its standardized version z = \frac{y-(\mu_{1}-\mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}} is the starting point to deduce the test-statistic.

This statistic reduces to z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}} under the null hypothesis that \mu_{1}-\mu_{2}=0.

Last week, we entertained the idea that we are comparing the difference in the means of two populations whose variance is equal and reasoned that the test-statistic follows a T-distribution with (n_{1}+n_{2}-2) degrees of freedom.

This is because the pooled population variance \sigma^{2} can be replaced by its unbiased estimator s^{2}, which, in turn is related to a Chi-squared distribution with (n_{1}+n_{2}-2) degrees of freedom.

When the population variance is not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, there is no pooled variance which can be related to the Chi-square distribution.

The best estimate of V[y] is obtained by replacing the individual population variances (\sigma_{1}^{2}, \sigma_{2}^{2}) by the sample variances (s_{1}^{2}, s_{2}^{2}).

Hence,

z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}

We should now identify the limiting distribution of V[y]

Bernard Lewis Welch, a British statistician, in his works in 1936, 1938, and 1947 explained that, with some adjustments, V[y] can be approximated to a Chi-square distribution, and hence the test-statistic \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}} can be appriximated to a T-distribution.

Let’s digest the salient points of his work

Assume \lambda_{1} = \frac{1}{n_{1}} and \lambda_{2} = \frac{1}{n_{2}}

Assume f_{1} = (n_{1}-1) and f_{2} = (n_{2}-1)

An estimate for the variance of y is \lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2}

Now, in Lesson 73, we learned that \frac{(n-1)s^{2}}{\sigma^{2}} is a Chi-square distribution. Based on this logic, we can write

s^{2} = \frac{1}{n-1}\sigma^{2}*\chi^{2}

So, \lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2} is of the form,

\lambda_{1}\frac{1}{n_{1}-1}\sigma_{1}^{2}\chi_{1}^{2}+\lambda_{2}\frac{1}{n_{2}-1}\sigma_{2}^{2}\chi_{2}^{2}

or,

\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2} = a\chi_{1}^{2}+b\chi_{2}^{2}

a = \frac{\lambda_{1}\sigma_{1}^{2}}{f_{1}} | b = \frac{\lambda_{2}\sigma_{2}^{2}}{f_{2}}

Welch showed that if z = a\chi_{1}^{2}+b\chi_{2}^{2}, then, the distribution of z can be approximated using a Chi-square distribution with a random variable \chi=\frac{z}{g} and f degrees of freedom.

He found the constants f and g by equating the moments of z with the moments of this Chi-square distribution, i.e., the Chi-square distribution where the random variable \chi is \frac{z}{g}.

This is how he finds f and g

The first and the second moments of a general Chi-square distribution are the degrees of freedom (f) and two times the degrees of freedom (2f).

The random variable we considered is \chi=\frac{z}{g}.

So,

E[\frac{z}{g}] = f | V[\frac{z}{g}] = 2f

Since g is a constant that needs to be estimated, we can reduce these equations as

\frac{1}{g}E[z] = f | \frac{1}{g^{2}}V[z] = 2f

Hence,

E[z] = gf | V[z] = 2g^{2}f

Now, let’s take the equation z = a\chi_{1}^{2}+b\chi_{2}^{2} and find the expected value and the variance of z.

E[z] = aE[\chi_{1}^{2}]+bE[\chi_{2}^{2}]

E[z] = af_{1}+bf_{2}

V[z] = a^{2}V[\chi_{1}^{2}]+b^{2}V[\chi_{2}^{2}]

V[z] = a^{2}2f_{1}+b^{2}2f_{2}=2a^{2}f_{1}+2b^{2}f_{2}

Now, he equates the moments derived using the equation for z to the moments derived from the Chi-square distribution of \frac{z}{g}

Equating the first moments: gf=af_{1}+bf_{2}

Equating the second moments: 2g^{2}f = 2a^{2}f_{1}+2b^{2}f_{2}

The above equation can be written as

g*(gf) = a^{2}f_{1}+b^{2}f_{2}

or,

g*(af_{1}+bf_{2}) = a^{2}f_{1}+b^{2}f_{2}

From here,

g = \frac{a^{2}f_{1}+b^{2}f_{2}}{af_{1}+bf_{2}}

Using this with gf=af_{1}+bf_{2}, we can obtain f.

f = \frac{(af_{1}+bf_{2})^{2}}{a^{2}f_{1}+b^{2}f_{2}}

Since we know the terms for a and b, we can say,

f = \frac{(\lambda_{1}\sigma_{1}^{2}+\lambda_{2}\sigma_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}\sigma_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}\sigma_{2}^{4}}{f_{2}}}

Since the variance of y follows an approximate Chi-square distribution with f degrees of freedom, we can assume that the test-statistic \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}} follows an approximate T-distribution with f degrees of freedom.

Welch also showed that an unbiased estimate for f is

f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}+2}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}+2}} - 2

Essentially, he replaces s_{1}^{2} for \sigma_{1}^{2}, and s_{2}^{2} for \sigma_{2}^{2}, and to correct for the bias, he adds 2 to the degrees of freedom in the denominator, and substracts an overall 2 from this fraction. He argues that this correction produces the best unbiased estimate for f.

Later authors like Franklin E. Satterthwaite showed that the bias correction might not be necessary, and it would suffice to use s_{1}^{2} for \sigma_{1}^{2}, and s_{2}^{2} for \sigma_{2}^{2} in the original equation, as in,

f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}}}

Since we know that \lambda_{1} = \frac{1}{n_{1}}, \lambda_{2} = \frac{1}{n_{2}}, f_{1} = (n_{1}-1), and f_{2} = (n_{2}-1), we can finally say

When the population variances are not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, the test-statistic is t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}, and it follows an approximate T-distribution with f degrees of freedom.

The degrees of freedom can be estimated as

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2

or,

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}

This is now popularly known as Welch's t-Test.

Let’s now go back to Tom and help him with his new theory.

We will compare the west branch Mohawk River with the Utica River.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
n_{1}=10 | \bar{x_{1}} = 12.5 | s_{1}^{2} = 58.28

Utica River: 4, 4, 6, 4, 5, 7, 8
n_{3}=7 | \bar{x_{3}} = 5.43 | s_{3}^{2} = 2.62

Since we cannot assume that the population variances are equal, we will use Welch’s t-Test.

We compute the test-statistic and check how likely it is to see such a value in a T-distribution (approximate null distribution) with so many degrees of freedom.

t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{3}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{3}^{2}}{n_{3}}}}

t_{0}^{*}=\frac{12.5-5.43}{\sqrt{\frac{58.28}{10}+\frac{2.62}{7}}}

t_{0}^{*}=2.84

Let’s compute the bias-corrected degrees of freedom suggested by Welch.

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2

f = \frac{(\frac{58.28}{10}+\frac{2.62}{7})^{2}}{\frac{(58.28/10)^{2}}{(10 - 1) + 2}+\frac{(2.62/7)^{2}}{(7-1)+2}} - 2

f = 10.38

We can round it down to 10 degrees of freedom.

The test-statistic is 2.84. Since the alternate hypothesis is that the difference is greater than zero, Tom has to verify how likely it is to see a value greater than 2.84 in the approximate null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. 

Look at this visual.

The distribution is an approximate T-distribution with 10 degrees of freedom. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.37.

1.37 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with ten degrees of freedom.

If the test statistic (t_{0}^{*}) is greater than t_{critical}, which is 1.37, he will reject the null hypothesis. At that point (i.e., at values greater than 1.37), there would be sufficient confidence to say that the difference is significantly greater than zero.

It is equivalent to rejecting the null hypothesis if P(T > t_{0}^{*}) (the p-value) is less than \alpha

We can read t_{critical} off the standard T-table, or we can compute P(T > t_{0}^{*}) from the distribution.

At ten degrees of freedom (df=10), and \alpha=0.1, t_{critical}=1.372 and P(T>t_{0^{*}})=0.009.

Since the test-statistic t_{0}^{*} is in the rejection region, or since p-value < \alpha, Tom can reject the null hypothesis H_{0} that \mu_{1}-\mu_{2}=0.
He now has evidence beyond statistical doubt to claim that the factory is illegally dumping their untreated waste into the west branch Mohawk River.

Is it time for a lawsuit?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 94 – The Two-Sample Hypothesis Test – Part III

On the Difference in Means

using the t-Test

H_{0}: \mu_{1} - \mu_{2} = 0

H_{A}: \mu_{1} - \mu_{2} > 0

H_{A}: \mu_{1} - \mu_{2} < 0

H_{A}: \mu_{1} - \mu_{2} \neq 0

Tom grew up in the City of Mohawk, the land of natural springs, known for its pristine water. Tom’s house is alongside the west branch of Mohawk, one such pristine river. Every once in a while, Tom and his family go to the nature park on the banks of Mohawk. It is customary for Tom and his children to take a swim.

Lately, he has been reading in the local newspapers that the rivers’ arsenic levels have increased. Tom starts associating the cause of the alleged arsenic increases to this new factory in his neighborhood just upstream of MohawkThey could be illegally dumping their untreated waste into the west branch.

He decided to test the waters of the west branch and the east branch of the Mohawk River. 

His buddy Ron, an environmental engineer, would help him with the laboratory testing and the likes.

Over the next ten days, Tom and Ron collected water samples from the west and east branches, and Ron got them tested in his lab for arsenic concentration.

In parts per billion, the sample data looked like this.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5

If Tom’s theory is correct, they should find the average arsenic concentration in the west branch to be significantly greater than the average arsenic concentration in the east branch. 

How can Tom test his theory?

.

.

.

You are right!

He can use the hypothesis testing framework and verify if there is evidence beyond a statistical doubt.

Tom establishes the null and alternate hypotheses. He assumes that the factory does not illegally dump their untreated waste into the west branch, so the average arsenic concentration in the west branch should be equal to the average arsenic concentration in the east branch Mohawk River. Or, the difference in their means is zero.

H_{0}: \mu_{1} - \mu_{2} = 0

Against this null hypothesis, he pits his theory that they are indeed illegally dumping their untreated waste. So, the average arsenic concentration in the west branch should be greater than the average arsenic concentration in the east branch Mohawk River — the difference in their means is greater than zero.

H_{A}: \mu_{1} - \mu_{2} > 0

The alternate hypothesis is one-sided. A significant positive difference needs to be seen to reject the null hypothesis.

Tom is taking a 10% risk of rejecting the null hypothesis; \alpha = 0.1. His Type I error is 10%.

Suppose the factory does not affect the water quality, but, the ten samples he collected showed a difference in the sample means much greater than zero, he should reject the null hypothesis. So he is committing an error (Type I error) in his decision making.

You must be knowing that there is a certain level of subjectivity in the choice of \alpha.

Tom may want to prove that this factory is the leading cause for the increased arsenic levels in the west branch. So he could have chosen to commit a greater error of rejecting the null hypothesis, i.e., he must be inclined to selecting a larger value of \alpha.

Someone who represents the factory management would be inclined to selecting a smaller value for \alpha as that means less likely to reject the null hypothesis.

So, assuming that the null hypothesis is true, the decision to reject or not to reject is based on the value one chooses for \alpha.

Anyhow, now that the basic testing framework is set up, let’s look at what Tom needs.

He needs a test-statistic to represent the difference in the means of two samples. 
He needs the null distribution that this test-statistic converges to. In other words, he needs a probability distribution of the test-statistic to verify his null hypothesis -- how likely it is to see a value as large (or greater) as the test-statistic in the null distribution.

Let’s take Tom on a mathematical excursion

There are two samples represented by random variables X_{1} and X_{2}.

The mean and variance of X_{1} are \mu_{1} and \sigma_{1}^{2}. We have one sample of size n_{1} from this population. Suppose the sample mean and the sample variance are \bar{x_{1}} and s_{1}^{2}.

The mean and variance of X_{2} are \mu_{2} and \sigma_{2}^{2}. We have one sample of size n_{2} from this population. Suppose the sample mean and the sample variance are \bar{x_{2}} and s_{2}^{2}.

The hypothesis test is on the difference in means; \mu_{1} - \mu_{2}

Naturally, a good estimator of the difference in population means (\mu_{1}-\mu_{2}) is the difference in sample means (\bar{x_{1}}-\bar{x_{2}}).

y = \bar{x_{1}}-\bar{x_{2}}

If we know the probability distributions of \bar{x_{1}} and \bar{x_{2}}, we could perhaps infer the probability distribution of y.

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}] = \mu. We learned this in Lesson 67.

The variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n}. It indicates the spread around the center of the distribution. We learned this is Lesson 68.

Putting these two together, and with the central limit theorem, we can say \bar{x} \sim N(\mu,\frac{\sigma^{2}}{n})

So, for the two samples,

\bar{x_{1}} \sim N(\mu_{1},\frac{\sigma_{1}^{2}}{n_{1}}) | \bar{x_{2}} \sim N(\mu_{2},\frac{\sigma_{2}^{2}}{n_{2}})

If \bar{x_{1}} and \bar{x_{2}} are normal distributions, it is reasonable to assume that y=\bar{x_{1}}-\bar{x_{2}} will be a normal distribution.

y \sim N(E[y], V[y])

We should see what E[y] and V[y] are.

Expected Value of y

y = \bar{x_{1}} - \bar{x_{2}}

E[y] = E[\bar{x_{1}} - \bar{x_{2}}]

E[y] = E[\bar{x_{1}}] - E[\bar{x_{2}}]

Since the expected value of the sample mean is the true population mean,

E[y] = \mu_{1} - \mu_{2}

Variance of y

y = \bar{x_{1}} - \bar{x_{2}}

V[y] = V[\bar{x_{1}} - \bar{x_{2}}]

V[y] = V[\bar{x_{1}}] + V[\bar{x_{2}}] (can you tell why?)

Since, the variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n},

V[y] = \frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}

Using the expected value and the variance of y, we can now say that

y \sim N((\mu_{1} - \mu_{2}), (\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}))

Or, if we standardize it,

z = \frac{y-(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

At this point, it might be tempting to say that z is the test-statistic and the null distribution is the standard normal distribution.

Hold on!

We can certainly say that, under the null hypothesis, \mu_{1} = \mu_{2}. So, z further reduces to

z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

However, there are two unknowns here. The population variance \sigma_{1}^{2} and \sigma_{2}^{2}.

Think about making some assumptions about these unknowns

What could be a reasonable estimate for these unknowns?

.

.

.

You got it. The sample variance s_{1}^{2} and s_{2}^{2}.

Now, let’s take the samples that Tom and Ron collected and compute the sample mean and sample variance for each. The equations for these are anyhow at your fingertips!

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
\bar{x_{1}}=12.5 | s_{1}^{2}=58.28

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
\bar{x_{2}}=12 | s_{2}^{2}=54.89

Look at the value of the sample variance. 58.28 and 54.89. They seem close enough. Maybe, just maybe, could the variance of the samples be equal?

I am asking you to entertain the proposition that the population variance of the two random variables X_{1} and X_{2} are equal and that we are comparing the difference in the means of two populations whose variance is equal.

Say, \sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}

Our test-statistic will now reduce to

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma^{2}}{n_{1}} + \frac{\sigma^{2}}{n_{2}}}}

or

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

There is a common variance \sigma^{2}, and it should suffice to come up with a reasonable estimate for this combined variance.

Let’s say that s^{2} is a good estimate for \sigma^{2}.

We call this the estimate for the pooled variance

If we have a formula for s^{2} and if it is an unbiased estimate for \sigma^{2}, then, we can replace s^{2} for \sigma^{2}.

What is the right equation for s^{2}?

Since s^{2} is the estimate for the pooled variance, it will be reasonable to assume that it is some weighted average of the individual sample variances.

s^{2} = w_{1}s_{1}^{2} + w_{2}s_{2}^{2}

Let’s compute the expected value of s^{2}.

E[s^{2}] = E[w_{1}s_{1}^{2} + w_{2}s_{2}^{2}]

E[s^{2}] = w_{1}E[s_{1}^{2}] + w_{2}E[s_{2}^{2}]

In Lesson 70, we learned that sample variance s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} is an unbiased estimate for population variance \sigma^{2}. The factor (n-1) in the denominator is to correct for a bias of -\frac{\sigma^{2}}{n}.

An unbiased estimate means E[s^{2}]=\sigma^{2}. If we apply this understanding to the pooled variance equation, we can see that

E[s^{2}] = w_{1}\sigma_{1}^{2} + w_{2}\sigma_{2}^{2}

Since \sigma_{1}^{2}=\sigma_{2}^{2} = \sigma^{2}, we get

E[s^{2}] = w_{1}\sigma^{2} + w_{2}\sigma^{2}

E[s^{2}] = (w_{1} + w_{2})\sigma^{2}

To get an unbiased estimate for \sigma^{2}, we need the weights to add up to 1. What could those weights be? Could it relate to the sample sizes?

With that idea in mind, let’s take a little detour.

Let me ask you a question.

What is the probability distribution that relates to sample variance?

.

.

.

You might have to go down memory lane to Lesson 73.

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom. We learned that the term \frac{(n-1)s^{2}}{\sigma^{2}} is a sum of (n-1) squared standard normal distributions.

So,

\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} \sim \chi^{2}_{n_{1}-1} | \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n_{2}-1}

Since the two samples are independent, the sum of these two terms, \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} will follow a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Add them and see. \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} is a sum of (n_{1}-1) squared standard normal distributions, and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} is a sum of (n_{2}-1) squared standard normal distributions. Together, they are a sum of (n_{1}+n_{2}-2) squared standard normal distributions.

Since \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} + \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

we can say,

\frac{(n_{1}+n_{2}-2)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

So we can think of developing weights in terms of the degrees of freedom of the Chi-square distribution. The first sample contributes n_{1}-1 degrees of freedom, and the second sample contributes n_{2}-1 degrees of freedom towards a total of n_{1}+n_{2}-2.

So the weight of the first sample can be w_{1}=\frac{n_{1}-1}{n_{1}+n_{2}-2} and the weight of the second sample can be w_{2}=\frac{n_{2}-1}{n_{1}+n_{2}-2}, and they add up to 1.

This then means that the equation for the estimate of the pooled variance is

s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

Since it is an unbiased estimate of \sigma^{2}, we can use s^{2} in place of \sigma^{2} in our test-statistic, which then looks like this.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where, s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the estimate for the pooled variance.

Did you notice that I use t_{0} to represent the test-statistic?

Yes, I am getting to T with it.

It is a logical extension of the idea that, for one sample, t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom.

There, the idea was derived from the fact that when you replace the population variance with sample variance, and the sample variance is related to a Chi-square distribution with (n-1) degrees of freedom, the test-statistic t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom. Check out Lesson 73 (Learning from “Student”) to refresh your memory.

Here, in the case of the difference in means between two samples (\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}), the pooled population variance \sigma^{2} is replaced by its unbiased estimator (\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}, which, in turn, is related to a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Hence, under the proposition that the population variance of the two random variables X_{1} and X_{2} are equal, the test-statistic is t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, and it follows a T-distribution with (n_{1}+n_{2}-2) degrees of freedom.


Now let’s evaluate the hypothesis that Tom set up — Finally!!

We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

In Tom’s case, both n_{1}=n_{2}=10. So the weights will be equal to 0.5.

s^{2}=0.5s_{1}^{2}+0.5s_{2}^{2}

s^{2}=0.5*58.28+0.5*54.89

s^{2}=56.58

t_{0} = \frac{12.5-12}{\sqrt{56.58(\frac{1}{10}+\frac{1}{10})}}

t_{0} = \frac{0.5}{\sqrt{11.316}}

t_{0} = 0.1486

The test-statistic is 0.1486. Since the alternate hypothesis is that the difference is greater than zeros (H_{A}:\mu_{1}-\mu_{2}>0), Tom has to verify how likely it is to see a value greater than 0.1486 in the null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. A smaller than \alpha probability indicates that the difference is sufficiently large enough that, in a T-distribution with so many degrees of freedom, the likelihood of seeing a value greater than the test-statistic is small. In other words, the difference in the mean is already sufficiently greater than zero and in the region of rejection.

Look at this visual.

The distribution is a T-distribution with 18 degrees of freedom (10 + 10 – 2). Tom had collected ten samples each for this test. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.33.

1.33 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with eighteen degrees of freedom.

If the test statistic (t_{o}) is greater than t_{critical}, which is 1.33, he will reject the null hypothesis. At that point (i.e., at values greater than 1.33), there would be sufficient confidence to say that the difference is significantly greater than zero.

This decision is equivalent to rejecting the null hypothesis if P(T>t_{0}) (the p-value) is less than \alpha.

We can read t_{critical} off the standard T-table, or P(T>t_{0}) can be computed from the distribution.

At df = 18, and \alpha=0.1, t_{critical}=1.33 and P(T>t_{0})=0.44.

Since the test-statistic t_{0} is not in the rejection region, or since p-value > \alpha, Tom cannot reject the null hypothesis H_{0} that \mu_{1} - \mu_{2}=0
He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

Tom is not convinced. The endless newspaper stories on arsenic levels are bothering him. He begins to think if the factory is illegally dumping their untreated waste into both the west and east branches of the Mohawk River. That is one reason why he might have seen no significant difference in the concentrations.

Over the next week, he takes Ron with him to Utica River, a tributary of Mohawk that branches off right before Mohawk meets the factory. If his new theory is correct, he should find that the mean arsenic concentration in either the west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

Ron again helped him with the laboratory testing. In parts per billion, the third sample data looks like this.

Utica River: 4, 4, 6, 4, 5, 7, 8

There are 7 data points. n_{3}=7

The sample mean \bar{x_{3}}=5.43

The sample variance s_{3}^{2} = 2.62

Can you help Tom with his new hypothesis tests? Does the proposition still hold?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 93 – The Two-Sample Hypothesis Test – Part II

On the Difference in Proportions

H_{0}: p_{1}-p_{2} = 0

H_{A}: p_{1}-p_{2} > 0

H_{A}: p_{1}-p_{2} < 0

H_{A}: p_{1}-p_{2} \neq 0

Joe and Mumble are interested in getting people’s opinion on the preference for a higher than 55 mph speed limit for New York State.
Joe spoke to ten of his rural friends, of which seven supported the idea of increasing the speed limit to 65 mph. Mumble spoke to eighteen of his urban friends, of which five favored a speed limit of 65 mph over the current limit of 55 mph.

Can we say that the sentiment for increasing the speed limit is stronger among rural than among urban residents?

We can use a hypothesis testing framework to address this question.

Last week, we learned how Fisher’s Exact test could be used to verify the difference in proportions. The test-statistic for the two-sample hypothesis test follows a hypergeometric distribution when H_{0} is true.

We also learned that, in more generalized cases where the number of successes is not known apriori, we could assume that the number of successes is fixed at t=x_{1}+x_{2}, and, for a fixed value of t, we reject H_{0}:p_{1}=p_{2} for the alternate hypothesis H_{A}:p_{1}>p_{2} if there are more successes in random variable X_{1} compared to X_{2}.

In short, the p-value can be derived under the assumption that the number of successes X=k in the first sample X_{1} has a hypergeometric distribution when H_{0} is true and conditional on a total number of t successes that can come from any of the two random variables X_{1} and X_{2}.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}


Let’s apply this principle to the two samples that Joe and Mumble collected.

Let X_{1} be the random variable that denotes Joe’s rural sample. He surveyed a total of n_{1}=10 people and x_{1}=7 favored an increase in the speed limit. So the proportion p_{1} based on the number of successes is 0.7.

Let X_{2} be the random variable that denotes Mumble’s urban sample. He surveyed a total of n_{2}=18 people. x_{2}=5 out of the 18 favored an increase in the speed limit. So the proportion p_{2} based on the number of successes is 0.2778.

Let the total number of successes in both the samples be t=x_{1}+x_{2}=7+5=12.

Let’s also establish the null and alternate hypotheses.

H_{0}: p_{1}-p_{2}=0

H_{A}: p_{1}-p_{2}>0

The alternate hypothesis says that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Larger values of x_{1} and smaller values of x_{2} support the alternate hypothesis H_{A} that p_{1}>p_{2} when t is fixed.

For a fixed value of t, we reject H_{0}, if there are more number of successes in X_{1} compared to X_{2}.

Conditional on a total number of t successes from any of the two random variables, the number of successes X=k in the first sample has a hypergeometric distribution when H_{0} is true.

In the rural sample that Joe surveyed, seven favored an increase in the speed limit. So we can compute the p-value as the probability of obtaining more than seven successes in a rural sample of 10 when the total successes t from either urban or rural samples are twelve.

p-value=P(X \ge k) = P(X \ge 7)

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{10+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338

A total of 12 successes exist, out of which the number of ways of choosing 7 is \binom{12}{7}.

A total of 28 – 12 = 16 non-successes exist, out of which the number of ways of choosing 10 – 7 = 3 non-successes is \binom{16}{3}.

A total sample of 10 + 18 = 28 exists, out of which the number of ways of choosing ten samples is \binom{28}{10}.

When we put them together, we can derive the probability P(X=7) for the hypergeometric distribution when H_{0} is true.

P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{18+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338

Applying the same logic for k = 8, 9, and 10, we can derive their respective probabilities.

P(X=8) = \frac{\binom{12}{8}\binom{10+18-12}{10-8}}{\binom{10+18}{10}} =\frac{\binom{12}{8}\binom{16}{2}}{\binom{28}{10}} = 0.0045

P(X=9) = \frac{\binom{12}{9}\binom{10+18-12}{10-9}}{\binom{10+18}{10}} =\frac{\binom{12}{9}\binom{16}{1}}{\binom{28}{10}} = 0.0003

P(X=10) = \frac{\binom{12}{10}\binom{10+18-12}{10-10}}{\binom{10+18}{10}} =\frac{\binom{12}{10}\binom{16}{0}}{\binom{28}{10}} = 5.029296*10^{-6}

The p-value can be computed as the sum of these probabilities.

p-value=P(X \ge k) = P(X = 7)+P(X = 8)+P(X = 9)+P(X = 10)=0.0386

Visually, the null distribution will look like this.

The x-axis shows the number of possible successes in X_{1}. They range from k = 0 to k = 10. The vertical bars are showing P(X=k) as derived from the hypergeometric distribution. The area highlighted in red is the p-value, the probability of finding \ge seven successes in a rural sample of 10 people.

The p-value is the probability of obtaining the computed test statistic under the null hypothesis. 

The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

Suppose we select a rate of error \alpha of 5%.

Since the p-value (0.0386) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Let me remind you that this decision is based on the assumption that the null hypothesis is correct. Under this assumption, since we selected \alpha = 5\%, we will reject the true null hypothesis 5% of the time. At the same time, we will fail to reject the null hypothesis 95% of the time. In other words, 95% of the time, our decision to not reject the null hypothesis will be correct.

What if Joe and Mumble surveyed many more people?

You must be wondering that Joe and Mumble surveyed just a few people, which is not enough to derive any decent conclusion for a question like this. Perhaps they just called up their friends!

Let’s do a thought experiment. How would the null distribution look like if Joe and Mumble had double the sample size and the successes also increase in the same proportion? Would the p-value change?

Say Joe had surveyed 20 people, and 14 had favored an increase in the speed limit. n_{1} = 20; x_{1} = 14; p_{1} = 0.7.

Say Mumble had surveyed 36 people, and 10 had favored an increase in the speed limit. n_{2} = 36; x_{2} = 10; p_{2} = 0.2778.

p-value will then be P(X \ge 14) when there are 24 total successes.

The null distribution will look like this.

Notice that the null distribution is much more symmetric and looks like a bell curve (normal distribution) with an increase in the sample size. The p-value is 0.0026. More substantial evidence for rejecting the null hypothesis.

Is there a limiting distribution for the difference in proportion? If there is one, can we use it as the null distribution for the hypothesis test on the difference in proportion when the sample sizes are large.

While we embark on this derivation, let’s ask Joe and Mumble to survey many more people. When they are back, we will use new data to test the hypothesis.

But first, what is the limiting distribution for the difference in proportion?

We have two samples X_{1} and X_{2} of sizes n_{1} and n_{2}.

We might observe x_{1} and x_{2} successes in each of these samples. Hence, the proportions p_{1}, p_{2} can be estimated using \hat{p_{1}} = \frac{x_{1}}{n_{1}} and \hat{p_{2}} = \frac{x_{2}}{n_{2}}.

See, we are using \hat{p_{1}}, \hat{p_{2}} as the estimates of the true proportions p_{1}, p_{2}.

Take X_{1}. If the probability of success (proportion) is p_{1}, in a sample of n_{1}, we could observe x_{1}=0, 1, 2, 3, \cdots, n_{1} successes with a probabilty P(X=x_{1}) that is governed by a binomial distribution. In other words,

x_{1} \sim Bin(n_{1},p_{1})

Same logic applies to X_{2}.

x_{2} \sim Bin(n_{2},p_{2})

A binomial distribution tends to a normal distribution for large sample sizes; it can be estimated very accurately using the normal density function. We learned this in Lesson 48.

If you are curious as to how a binomial distribution function f(x)=\frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x} can approximated to a normal density function f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}, look at this link.

But what is the limiting distribution for \hat{p_{1}} and \hat{p_{2}}?

x_{1} is the sum of n_{1} independent Bernoulli random variables (yes or no responses from the people). For a large enough sample size n_{1}, the distribution function of x_{1}, which is a binomial distribution, can be well-approximated by the normal distribution. Since \hat{p_{1}} is a linear function of x_{1}, the random variable \hat{p_{1}} can also be assumed to be normally distributed.

When both \hat{p_{1}} and \hat{p_{2}} are normally distributed, and when they are independent of each other, their sum or difference will also be normally distributed. We can derive it using the convolution of \hat{p_{1}} and \hat{p_{2}}.

Let Y = \hat{p_{1}}-\hat{p_{2}}

Y \sim N(E[Y], V[Y]) since both \hat{p_{1}}, \hat{p_{2}} \sim N()

If Y \sim N(E[Y], V[Y]), we can standardize it to a standard normal variable as

Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)

We should now derive the expected value E[Y] and the variance V[Y] of Y.

Y = \hat{p_{1}}-\hat{p_{2}}

E[Y] = E[\hat{p_{1}}-\hat{p_{2}}] = E[\hat{p_{1}}] - E[\hat{p_{2}}]

V[Y] = V[\hat{p_{1}}-\hat{p_{2}}] = V[\hat{p_{1}}] + V[\hat{p_{2}}]

Since they are independent, the co-variability term which carries the negative sign is zero.

We know that E[\hat{p_{1}}] = p_{1} and V[\hat{p_{1}}]=\frac{p_{1}(1-p_{1})}{n_{1}}. Recall Lesson 76.

When we put them together,

E[Y] = p_{1} - p_{2}

V[Y] = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}

and finally since Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1),

Z = \frac{\hat{p_{1}} - \hat{p_{2}} - (p_{1} - p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}} \sim N(0, 1)

A few more steps and we are done. Joe and Mumble must be waiting for us.

The null hypothesis is H_{0}: p_{1}-p_{2}=0. Or, p_{1}=p_{2}.

We need the distribution under the null hypothesis — the null distribution.

Under the null hypothesis, let’s assume that p_{1}=p_{2} is p, a common value for the two population proportions.

Then, the expected value of Y, E[Y]=p_{1}-p_{2}=p-p = 0 and the variance V[Y] = \frac{p(1-p)}{n_{1}} + \frac{p(1-p)}{n_{2}}}

V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})

This shared value p for the two population proportions can be estimated by pooling the samples together into one sample of size n_{1}+n_{2} where there are x_{1} and x_{2} total successes.

p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}

Look at this estimate carefully. Can you see that the pooled estimate p is a weighted average of the two proportions (p_{1} and p_{2})?

.
.
.
Okay, tell me what x_{1} and x_{2} are? Aren’t they n_{1}\hat{p_{1}} and n_{2}\hat{p_{2}} for the given two samples?

So p = \frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}=\frac{n_{1}}{n_{1}+n_{2}}\hat{p_{1}}+ \frac{n_{2}}{n_{1}+n_{2}}\hat{p_{2}}

or, p = w_{1}\hat{p_{1}}+ w_{2}\hat{p_{2}}

At any rate,

E[Y]= 0

V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})

p=\frac{x_{1}+x_{2}}{n_{1}+n_{2}}

To summarize, when the null hypothesis is

H_{0}:p_{1}-p_{2}=0

for large sample sizes, the test-statistic z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)

If the alternate hypothesis H_{A} is p_{1}-p_{2}>0, we reject the null hypothesis when the p-value P(Z \ge z) is less than the rate of rejection \alpha. We can also say that when z > z_{\alpha}, we reject the null hypothesis.

If the alternate hypothesis H_{A} is p_{1}-p_{2}<0, we reject the null hypothesis when the p-value P(Z \le z) is less than the rate of rejection \alpha. Or when z < -z_{\alpha}, we reject the null hypothesis.

If the alternate hypothesis H_{A} is p_{1}-p_{2} \neq 0, we reject the null hypothesis when the p-value P(Z \le z) or P(Z \ge z) is less than the rate of rejection \frac{\alpha}{2}. Or when z < -z_{\frac{\alpha}{2}} or z > z_{\frac{\alpha}{2}}, we reject the null hypothesis.

Okay, we are done. Let’s see what Joe and Mumble have.

The rural sample X_{1} has n_{1}=190 and x_{1}=70.

The urban sample X_{2} has n_{2}=310 and x_{2}=65.

Let’s first compute the estimates for the respective proportions — p_{1} and p_{2}.

\hat{p_{1}}=\frac{x_{1}}{n_{1}}=\frac{70}{190} = 0.3684

\hat{p_{2}}=\frac{x_{2}}{n_{2}}=\frac{65}{310} = 0.2097

Then, let’s compute the pooled estimate p for the population proportions.

p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}=\frac{70+65}{190+310}=\frac{135}{500}=0.27

Next, let’s compute the test-statistics under the large-sample assumption. 190 and 310 are pretty large samples.

z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

z = \frac{0.3684-0.2097}{\sqrt{0.27(0.73)*(\frac{1}{190}+\frac{1}{310})}}=3.8798

Since our alternate hypothesis H_{A} is p_{1}-p_{2}>0, we compute the p-value as,
p-value=P(Z \ge 3.8798) = 5.227119*10^{-5} \approx 0

Since the p-value (~0) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Remember that the test-statistic is computed for the null hypothesis that p_{1}-p_{2}=0. What if the null hypothesis is not that the difference in proportions is zero but is equal to some value? p_{1}-p_{2}=0.25

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 92 – The Two-Sample Hypothesis Test – Part I

Fisher’s Exact Test

You may remember this from Lesson 38, where we derived the hypergeometric distribution from first principles.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

P(X=k) = \frac{\binom{R}{k}\binom{N-R}{R-k}}{\binom{N}{R}}

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution. The control parameters of the hypergeometric distribution are N and R.

For example, if there are five cans in total, out of which three are Pepsi cans, picking exactly two Pepsi cans can be done in \binom{3}{2}*\binom{2}{1} ways. Two Pepsi cans selected from three in \binom{3}{2} ways; one Coke can be selected from two Coke cans in \binom{2}{1}.

The overall possibilities of selecting three cans from a total of five cans are \binom{5}{3}.

Hence, P(X=2)=\frac{\binom{3}{2}*\binom{2}{1}}{\binom{5}{3}}=\frac{6}{10}


Now, suppose there are eight cans out of which four are Pepsi, and four are Coke, and we are testing John’s ability to identify Pepsi.

Since John has a better taste for Pepsi, he claims that he has a greater propensity to identify Pepsi from the hidden cans.

Of course, we don’t believe it, and we think his ability to identify Pepsi is no different than his ability to identify Coke.

Suppose his ability (probability) to identify Pepsi is p_{1} and his ability to identify Coke is p_{2}. We think p_{1}=p_{2} and John thinks p_{1} > p_{2}.

The null hypothesis that we establish is
H_{0}: p_{1} = p_{2}

John has an alternate hypothesis
H_{A}: p_{1} > p_{2}

Pepsi and Coke cans can be considered as two samples of four each.

Since there are two samples (Pepsi and Coke) and two outcomes (identifying or not identifying Pepsi), we can create a 2×2 contingency table like this.

John now identifies four cans as Pepsi out of the eight cans whose identity is hidden as in the fun experiment.

It turns out that the result of the experiment is as follows.

John correctly identified three Pepsi cans out of the four.

The probability that he will identify three correctly while sampling from a total of eight cans is

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}=\frac{\frac{4!}{1!3!}\frac{4!}{3!1!}}{\frac{8!}{4!4!}}=\frac{16}{70}=0.2286

If you recall from the prior hypothesis test lessons, you will ask for the null distribution. The null distribution is the probability distribution of observing any number of Pepsi cans while selecting from a total of eight cans (out of which four are known to be Pepsi). This will be the distribution that shows P(X=0), P(X=1), P(X=2), P(X=3), and P(X=4). Let’s compute these and present them visually.

P(X=0)=\frac{\binom{4}{0}*\binom{4}{4}}{\binom{8}{4}}==\frac{1}{70}=0.0143

P(X=1)=\frac{\binom{4}{1}*\binom{4}{3}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=2)=\frac{\binom{4}{2}*\binom{4}{2}}{\binom{8}{4}}==\frac{36}{70}=0.5143

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=4)=\frac{\binom{4}{4}*\binom{4}{0}}{\binom{8}{4}}==\frac{1}{70}=0.0143

In a hypergeometric null distribution with N = 8 and R = 4, what is the probability of getting a larger value than 3? If this has a sufficiently low probability, we cannot say that it may occur by chance.

This probability is the p-value. It is the probability of obtaining the computed test statistic under the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

P(X \ge 3)=P(X=3) + P(X=4) = 0.2286+0.0143=0.2429

Let us select a rate of error \alpha of 10%.

Since the p-value (0.2429) is greater than our selected rate of error (0.1), we cannot reject the null hypothesis that the probability of choosing Pepsi and the probability of choosing Coke are the same.

John should have selected all four Pepsi cans for us to be able to reject the null hypothesis (H_{0}: p_{1} = p_{2}) in favor of the alternative hypothesis (H_{A}: p_{1} > p_{2}) conclusively.


The Famous Fisher Test

We just saw a variant of the famous test conducted by Ronald Fisher in 1919 when he devised an offhand test of a lady’s ability to differentiate between tea prepared in two different ways.

One afternoon, at tea-time in Rothamsted Field Station in England, a lady proclaimed that she preferred her tea with the milk poured into the cup after the tea, rather than poured into the cup before the tea. Fisher challenged the lady and presented her with eight cups of tea; four made the way she preferred, and four made the other way. She was told that there were four of each kind and asked to determine which four were prepared properly. Fisher subsequently used this experiment to illustrate the basic issues in experimentation.

sourced from Chapter 5 of “Teaching Statistics, a bag of tricks” by Andrew Gelman and Deborah Nolan

This test, now popular as Fisher’s Exact Test, is the basis for the two-sample hypothesis test to verify the difference in proportions. Just like how the proportion (p) for the one-sample test followed a binomial null distribution, the test-statistic for the two-sample test follows a hypergeometric distribution when H_{0} is true.

Here, where we know the exact number of correct Pepsi cans, the true distribution of the test-statistic (number of correct Pepsi cans) is hypergeometric. In more generalized cases where the number of successes is not known apriori, we need to make some assumptions.


Say there are two samples represented by random variables X_{1} and X_{2} with sample sizes n_{1} and n_{2}. The proportion p_{1} is based on the number of successes (x_{1}) in X_{1}, and the proportion p_{2} is based on the number of successes (x_{2}) in X_{2}. Let the total number of successes in both the samples be t=x_{1}+x_{2}.

If the null hypothesis is H_{0}: p_{1} = p_{2}, then, large values of x_{1} and small values of x_{2} support the alternate hypothesis that H_{A}: p_{1} > p_{2} when t is fixed.

In other words, for a fixed value of t=x_{1}+x_{2}, we reject H_{0}: p_{1} = p_{2}, if there are more successes in X_{1} compared to X_{2}.

So the question is: what is the probability distribution of x_{1} when the total successes are fixed at t, and we have a total of n_{1}+n_{2} samples.

When the number of successes is t, and when H_{0}: p_{1} = p_{2} is true, these successes can come from any of the two random variables with equal likelihood.

A total sample of n_{1}+n_{2} exists out of which the number of ways of choosing n_{1} samples is \binom{n_{1}+n_{2}}{n_{1}}.

A total of t successes exist, out of which the number of ways of choosing k is \binom{t}{k}.

A total of n_{1}+n_{2}-t non-successes exist, out of which the number of ways of choosing n_{1}-k is \binom{n_{1}+n_{2}-t}{n_{1}-k}.

When we put them together, we can derive the probability P(X=k) for the hypergeometric distribution when H_{0} is true.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

Conditional on a total number of t successes that can come from any of the two random variables, the number of successes X=k in the first sample has a hypergeometric distribution when H_{0} is true.
The p-value can thus be derived.

We begin diving into the two-sample tests. Fisher’s Exact Test and its generalization (with assumptions) for the two-sample hypothesis test on the proportion is the starting point. It is a direct extension of the one-sample hypothesis test on proportion — albeit with some assumptions. Assumptions are crucial for the two-sample hypothesis tests. As we study the difference in the parameters of the two populations, we will delve into them more closely.
STAY TUNED!

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 91 – The One-Sample Hypothesis Tests in R

The last time we did something in R is in Lesson 83. Since then, we have been learning the hypothesis testing framework. In Lesson 84, we first made friends with the hypothesis tests — the concept of needing evidence beyond a reasonable doubt. In Lesson 85, we learned the framework — the elements of hypothesis testing that provide us with a systematic way of setting up the test and using data to verify the hypothesis.

In Lessons 86, 87, and 88, we learned the one-sample hypothesis test on proportion, mean, and the variance, respectively. Then, in Lesson 89, we learned the one-sample sign test as a complement to the one-sample hypothesis test on the mean, which can account for outliers as well. Last week (Lesson 90), we dived into the bootstrap-based method where we relaxed the assumptions on the null distribution. The bootstrap-based process provides us the flexibility to run the test on various statistics of interest.

Today, let’s pause and look at how to perform these hypothesis tests in R. Brace yourself for an R drive.

To help us with this lesson today, we will make use of the eight data points on 2003 Ford Focus vehicle mileage that Joe used in Lesson 89.

The Basic Steps

Step1: Get the data

You can get the data from here.

Step 2Create a new folder on your computer

Let us call this folder “lesson91”. 

Step 3Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson91” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson91_code.R”

Step 4Choose your working directory

“lesson91” is the folder where we stored the code. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5Read the data into R workspace

Since the sample data is small enough, instead of saving the data in a text file and reading it into R workspace, we can directly input the data into the workspace as follows:

## Data Entry ##
# FORD MPG - reported # 
  x = c(21.7, 29, 28.1, 31.5, 24, 21.5, 28.7, 29)

The Baseline for Null Hypotheses

Let’s run the hypothesis tests on the mean, standard deviation, and the proportion. For this, we have to make some modest assumptions.

For the mean, we will use the EPA specified rating of 26.2 MPG. \mu = 26.2 MPG.

For comparing the standard deviation, we are unsure of the baseline standard deviation from the EPA rating. However, they provide us with a range that is between 22 MPG and 32 MPG. If we can assume that the range is covered by six standard deviations, i.e., 32 - 22 = 6 \sigma, we can get an estimate for baseline \sigma. In this case, \sigma = 1.67 MPG.

Further, let’s also assume that one in eight cars will usually have a mileage less than the minimum rating of 22 MPG. p = \frac{1}{8}

# Compare to Baseline
  mu = 26
 
  rangex = c(22,32)
  sigma = diff(range_ford)/6

  n = length(x)
  p = 1/n 

Hypothesis Test on the Mean

For the hypothesis test on the mean, we learned three methods, the t-test, the sign-test, and the bootstrap-based test.

Let’s start with the t-test. For the t-test, the null and the alternate hypothesis are:

H_{0}: \mu \ge 26.2 MPG

H_{A}: \mu < 26.2 MPG

The null distribution is a T-distribution with (n-1) degrees of freedom. The test statistic is t_{0}=\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}. We have to verify how likely it is to see a value as large as t_{0} in the null distribution. 

Let’s compute the test statistic.

#T-Test #
   xbar = mean(x)

   s = sd(x)

   to = (xbar-mu)/(s/sqrt(n))
> to
   [1] 0.5173082

The test statistic is 0.517.

For a selected rejection rate of \alpha we have to compute the p-value corresponding to this test statistic, or the critical value on the T-distribution.

   alpha = 0.05
   
   pvalue_ttest = pt(to,df=(n-1))   
   
   tcritical = qt(alpha,df=(n-1))
   

When you execute these lines, you will see that the p-value is 0.69, and the critical value from the t-distribution is -1.89. The “pt” function computed P(T \le t_{0}) from a T-distribution with user-specified degrees of freedom. For our case, df = 7. The “qt” function computes the quantile corresponding to a probability of 5%, i.e., \alpha=0.05 from a T-distribution with user-specified degrees of freedom.

> pvalue_ttest
   [1] 0.6895594

> tcritical
   [1] -1.894579 

Since the p-value is greater than 0.05, or since t_{0}>t_{critical}, we cannot reject the null hypothesis.

In R, there is a function to run this test — “t.test

We could have used the function directly on the sample data x.

# using t-Test function in R #
   t.test(x,alternative = "less", mu = mu, conf.level = 0.95)
One Sample t-test
 data:  x
 t = 0.51731, df = 7, p-value = 0.6896
 alternative hypothesis: true mean is less than 26
 95 percent confidence interval:
      -Inf 29.20539
 sample estimates:
 mean of x 
   26.6875 

It provides t_{0} and the p-value as the outputs from which we can decide on the null hypothesis.


Next, let’s look at how to conduct the sign-test in R. The null and the alternative hypothesis for the sign-test are

H_{0}: P(X > 26.2) = 0.5

H_{A}: P(X > 26.2) < 0.5

If H_{0} is true, about half of the values of sample X will be greater than 26.2 MPG, and about half of them will be negative. If H_{A} is true, more than half of the values of sample X will be less than 26.2 MPG, i.e., the sample shows low mileages — significantly less than 26.2 MPG.

To run this test, we should compute s^{+}, the test statistic that determines the number of values exceeding 26.2 MPG.

#Sign-Test
 splus = length(which(x>mu))
 > splus
   [1] 5 

s^{+} is 5.

Under the null hypothesis, s^{+} follows a binomial distribution with a probability of 0.5. Using this assumption, we compute the p-value using the “binomial.test” function in R.

# Using Binom Test function in R 
   binom.test(splus,n,p=0.5,alternative = "less")
Exact binomial test
 data:  splus and n
 number of successes = 5, number of trials = 8, p-value = 0.8555
 alternative hypothesis: true probability of success is less than 0.5
 95 percent confidence interval:
  0.0000000 0.8888873
 sample estimates:
 probability of success 
                  0.625 

The test provides the p-value, i.e., P(S^{+} \le 5) from a binomial distribution with n = 8 and p = 0.5.

P(S^{+} \le 5)=\sum_{k=0}^{k=5}{8 \choose k} p^{k}(1-p)^{8-k}=0.8555

Since the p-value is greater than 0.05, the rejection rate, we cannot reject the null hypothesis.


Lastly, we will check out the bootstrap-based test on the mean. For the bootstrap-based one-sided test, the null hypothesis and the alternate hypothesis are

H_{0}: P(\bar{x} > 26.2) = 0.5

H_{A}: P(\bar{x} > 26.2) < 0.5

Use the following lines to set up and run the test.

#Bootstrap Test
 N = 10000

 xmean_null = matrix(NA,nrow=N,ncol=1)
  for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xmean_null[i,1] = mean(xboot)
   }
 
hist(xmean_null,col="pink",xlab="Mean of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(mu,10,pch=15,cex=2,col="blue")

abline(v=c(quantile(xmean_null,0.95)),col="black",lwd=2,lty=2)

pvalue_bootstrap = length(which(xmean_null>mu))/N
> pvalue_bootstrap
   [1] 0.7159 

In the loop, we are executing the “sample” function to draw a bootstrap-replicate from the original data. Then, from this bootstrap-replicate, we compute the mean. Repeated such sampling and computation of the bootstrap-replicated mean statistic forms the null distribution. From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of 26.2 MPG.

Since the p-value (0.716) is greater than a 5% rejection rate, we cannot reject the null hypothesis.

We can also observe that the code provides a way to visualize the null distribution and the basis value of 26.2 MPG. If the basis value \mu=26.2 is so far out on the null distribution of \bar{x} that less than 5% of the bootstrap-replicated test statistics are greater than \mu, we would have rejected the null hypothesis.

Hypothesis Test on the Variance

For the hypothesis test on the variance, we learned two methods, the Chi-square test, and the bootstrap-based test.

Let’s first look at the Chi-square test. The null and alternate hypothesis is as follows:

\sigma^{2}=1.67^{2}

\sigma^{2} \neq 1.67^{2}

The alternate hypothesis is two-sided. Deviation in either direction (less than or greater than) will reject the null hypothesis.

If you can recall from Lesson 88, \frac{(n-1)s^{2}}{\sigma^{2}} is our test statistic, \chi^{2}_{0}, which we will verify against a Chi-square distribution with (n-1) degrees of freedom.

# Chi-square Test
chi0 = ((n-1)*s^2)/sigma^2

pvalue_chi0 = pchisq(chi0,df=(n-1))

chilimit_right = qchisq(0.975,df=(n-1))

chilimit_left = qchisq(0.025,df=(n-1))

The “pchisq” function computes P(\chi^{2} \le \chi^{2}_{0}) from a Chi-square distribution with user-specified degrees of freedom, df = 7. The “qchisq” function computes the quantile corresponding to a specified rate of rejection. Since we are conducting a two-sided test, we calculate the limiting value on the left tail and the right tail.

 > pvalue_chi0
   [1] 0.9999914 

The p-value is 0.999. Since it is greater than 0.975, we reject the null hypothesis based on the two-sided test.

> chi0
   [1] 35.60715

> chilimit_left
   [1] 1.689869

> chilimit_right
   [1] 16.01276 

The lower and the upper bound from the null distribution are 1.69 and 16.01, respectively. Whereas, the test statistic \chi^{2}_{0} is 35.60, well beyond the upper bound acceptable value. Hence we reject the null hypothesis.


Now, let’s look at how to run a bootstrap-based test for the standard deviation. For the bootstrap-based two-sided test, the null hypothesis and the alternate hypothesis are

H_{0}: P(\sigma > 1.67) = 0.5

H_{A}: P(\sigma > 1.67) \neq 0.5

Use the following lines to set up and run the test.

# Bootstrap Test for Standard Deviation
 N = 10000
 
xsd_null = matrix(NA,nrow=N,ncol=1)
 for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xsd_null[i,1] = sd(xboot)
   }

hist(xsd_null,col="pink",xlab="Standard Deviation of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(sigma,10,pch=15,cex=2,col="blue")
   abline(v=c(quantile(xsd_null,0.025),quantile(xsd_null,0.975)),col="black",lwd=2,lty=2)

pvalue_bootstrap = length(which(xsd_null>sigma))/N
> pvalue_bootstrap
   [1] 0.9747 

As before, in the loop, we are executing the “sample” function to draw a bootstrap-replicate from the original data. Then, from this bootstrap-replicate, we compute the standard deviation. Repeated such sampling and computation of the bootstrap-replicated standard deviation statistic forms the null distribution for \sigma. From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of 1.67 MPG.

Since the p-value (0.975) is greater than or equal to 0.975, (\frac{\alpha}{2} rejection rate), we reject the null hypothesis.

The code provides a way to visualize the null distribution and the basis value of 1.67 MPG. The basis value is far left on the null distribution. Almost 97.5% of the bootstrap-replicated test statistics are greater than \sigma = 1.67 — we reject the null hypothesis.

Hypothesis Test on the Proportion

For the hypothesis test on the proportion, we can employ the binomial distribution (parametric) as the null distribution, or use the bootstrap-based method (non-parametric) to generate the null distribution. For the parametric approach, the null and the alternative hypothesis are

H_{0}: p \le \frac{1}{8}

H_{A}: p > \frac{1}{8}

We are considering a one-sided alternative since the departure in one direction (greater than) is sufficient to reject the null hypothesis.

From a sample of eight cars, the null distribution is the probability distribution of observing any number of cars having a mileage of less than 22 MPG. In other words, out of the eight cars, we could have 0, 1, 2, 3, …, 8 cars to have a mileage less than 22 MPG, the lower bound specified by EPA.

The null distribution is the distribution of the probability of observing these outcomes. In a Binomial null distribution with n = 8 and p = 1/8, what is the probability of getting 0, 1, 2, …, 8? and if more than an acceptable number (as seen in the null distribution) is seen, we reject the null hypothesis.

We can generate and plot that null distribution in R using the following lines.

## Hypothesis Test on the Proportion
 
ncars = length(which(x<22))

plot(0:n,dbinom(0:n,n,prob=p),type="o",xlab="Number of cars with MPG less than the range",ylab="P(X=x)",font=2,font.lab=2)

x_at_alpha = qbinom(0.95,n,prop) # approx quantile for 5% rejection 

cord.x <- c(x_at_alpha,seq(x_at_alpha,n),x_at_alpha) 

cord.y <- c(0, dbinom(x_at_alpha:n,n,prob=p), 0) 

polygon(cord.x,cord.y,col="pink")

points(ncars,0.002,pch=15,col="blue",cex=2)

pval = sum(dbinom(ncars:n,n,prob=p))

In the first line, we are computing the number of cars that have a mileage of less than 22 MPG, which is the test statistic. We observe this to be two cars.

> ncars
   [1] 2 

In the next few lines, we are plotting the null distribution as computed from a binomial distribution with n = 8 and p = 1/8, and visually showing the region of rejection using the “polygon” function.

Then, we compute the p-value as the probability of observing two or more cars having a mileage less than 22 MPG.

P(Y \ge 2) = \sum_{k=2}^{k=8} {8 \choose k} p^{k}(1-p)^{8-k}=0.2636

> pval
   [1] 0.2636952 

Since the p-value is greater than 0.05, we cannot reject the null hypothesis (based on the limited sample) that one in eight cars (2003 Ford Focus) will have a mileage of less than 22 MPG.


Let’s wrap up the day by running the bootstrap test on the proportion. For the bootstrap-based hypothesis test on the proportion, the null and the alternate hypothesis are

H_{0}: P(p > \frac{1}{8}) = 0.5

H_{A}: P(p > \frac{1}{8}) > 0.5

If we get a large proportion of the null distribution to be greater than 1/8, we will reject the null hypothesis.

Use the following lines to execute the test.

# Bootstrap Test on Proportion 
 N = 10000

xprop_null = matrix(NA,nrow=N,ncol=1)
 for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xprop_null[i,1] = length(which(xboot<22))/n
   }

hist(xprop_null,col="pink",xlab="Proportion of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(prop,10,pch=15,cex=2,col="blue")

abline(v=c(quantile(xprop_null,0.95)),col="black",lwd=2,lty=2)

pval=length(which(xprop_null>prop))/N
> pval
   [1] 0.631 

We are executing the “sample” function N = 10,000 times to draw bootstrap-replicates from the original data. Then, for each bootstrap-replicate, we compute the proportion that is less than 22 MPG. The 10,000 bootstrap-replicated proportions forms the null distribution for p. From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of \frac{1}{8}

Since the p-value (0.631) is less than 0.95, (1-\alpha rejection rate), we cannot reject the null hypothesis. 63% of the bootstrap-replicated proportion is greater than the baseline. If more than 95% of the bootstrap-replicated proportion is greater than the baseline, we would have rejected the null hypothesis. This scenario would have unfolded if the original sample had several cars with less than a mileage of 22 MPG. The bootstrap replicates would show them, in which case the proportion would be much greater than 1/8 in many replicates, and overall, the p-value would exceed 0.95.

Take the next two weeks to digest all the one-sample hypothesis tests, including how to execute the tests in R. In two weeks, we will move on to the two-sample hypothesis tests.

Here is the full code for today’s lesson.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 90 – The One-Sample Hypothesis Tests Using the Bootstrap

Hypothesis Tests – Part V

H_{0}: P(\theta > \theta^{*}) = 0.5

H_{A}: P(\theta > \theta^{*}) > 0.5

H_{A}: P(\theta > \theta^{*}) < 0.5

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Jenny and Joe meet after 18 lessons.

I heard you are neck-deep into the hypothesis testing concepts.

Yes. And I am having fun learning about how to test various hypotheses, be in on the mean, on the standard deviation, or the proportion. It is also enlightening to learn how to approximate the null distribution using the limiting distribution concepts.

True. You have seen in lesson 86 — hypothesis tests on the proportion, that the null distribution is a Binomial distribution with n, the sample size, and p, the proportion being tested.

You have seen in lesson 87 — hypothesis tests on the mean, that the null distribution is a T-distribution because \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}

You have seen in lesson 88 — hypothesis tests on the variance, that the null distribution is a Chi-square distribution because \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}

Did you ever wonder what if the test statistic is more complicated mathematically than the mean or the variance or the proportion and if its limiting distribution or the null distribution is hard to derive?

Or, did you ever ask what if the assumptions that go into deriving the null distribution are not met or not fully satisfied?

Can you give an example?

Suppose you want to test the hypothesis on the median or the interquartile range or the skewness of a distribution?

Or, if you are unsure about the distributional nature of the sample data? For instance, the assumption that \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1} is based on the premise that the sample data is normally distributed.

😕 There are non-parametric or distribution-free approaches? I remember Devine mentioning a bootstrap approach.

😎 In lesson 79, we learned about the concept of the bootstrap. Using the idea of the bootstrap, we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely (with a probability of 1/n), we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Hmm. Since each bootstrap replicate is a possible representation of the population, we can compute the test statistics from this bootstrap sample. And, by repeating this, we can have many simulated values of the test statistics to create the null distribution against which we can test the hypothesis.

Exactly. No need to make any assumption on the distributional nature of the data at hand, or the kind of the limiting distribution for the test statistic. We can compute any test statistic from the bootstrap replicates and test the basis value using this simulated null distribution. You want to test the hypothesis on the median, go for it. On the skewness or geometric mean, no problem.

This sounds like an exciting approach that will free up the limitations. Why don’t we do it step by step and elaborate on the details. Our readers will appreciate it.

Absolutely. Do you want to use your professor’s hypothesis that the standard deviation of his class’s performance is 16.5 points, as a case in point?

Sure. In a recent conversation, he also revealed that the mean and median scores are 54 and 55 points, respectively and that 20% of his class usually get a score of less than 40.

Aha. We can test all four hypotheses then. Let’s take the sample data.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Yes, this is a sample of ten exam scores from our most recent test with him.

Let’s first review the concept of the bootstrap. We have the following data.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these ten data points is 1/10, we can randomly draw ten numbers from these ten values — with replacement.

Yes, I can recall from lesson 79 that this is like playing the game of Bingo where the chips are these ten numbers. Each time we get a number, we put it back and roll it again until we draw ten numbers.

Yes. For real computations, we use a computer program that has this algorithm coded in. We draw a random number from a uniform distribution (f(u)) where u is between 0 and 1. These randomly drawn u's are mapped onto the ranked data to draw a specific value from the set. For example, in a set of ten values, for a randomly drawn u of 0.1, we can draw the first value in order — 33.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (60, 41, 70, 61, 69, 95, 33, 34, 82, 82), some may appear more than one time, and some may not show up at all in a random sample.

Let me create a bootstrap replicate.

See, 70 appeared two times, 82 appeared three times, and 33 did not get selected at all.

Such bootstrap replicates are representations of the empirical distribution \hat{f}. The empirical distribution \hat{f} is the proportion of times each value in the data sample x_{1}, x_{2}, x_{3}, …, x_{n} occurs. If we assume that the data sample has been generated by randomly sampling from the true distribution, then, the empirical distribution (i.e., the observed frequency) \hat{f} is a sufficient statistic for the true distribution f.

In other words, all the information contained in the true distribution can be generated by creating \hat{f}, the empirical distribution.

Yes. Since an unknown population distribution f has produced the observed data x_{1}, x_{2}, x_{3}, …, x_{n}, we can use the observed data to approximate f by its empirical distribution \hat{f} and then use \hat{f} to generate bootstrap replicates of the data.

How do we implement the hypothesis test then?

Using the same hypothesis testing framework. We first establish the null and the alternative hypothesis.

H_{0}: P(\theta > \theta^{*}) = 0.5

\theta is the test statistic computed from the bootstrap replicate and \theta^{*} is the basis value that we are testing. For example, a standard deviation of 16.5 is \theta^{*} and standard deviation computed from one bootstrap sample is \theta.

The alternate hypothesis is then,

H_{A}: P(\theta > \theta^{*}) > 0.5

or

H_{A}: P(\theta > \theta^{*}) < 0.5

or

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Essentially, for each bootstrap replicate i, we check whether \theta_{i} > \theta^{*}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

Now, we can repeat this process, i.e., creating a bootstrap replicate, computing the test statistic and verifying whether \theta_{i} > \theta^{*} or S_{i} \in (0,1) a large number of times, say N = 10,000. The proportion of times S_{i} = 1 in a set of N bootstrap-replicated test statistics is the p-value.

And we can apply the rule of rejection if the p-value < \alpha, the selected rate of rejection.

Correct. That is for a one-sided hypothesis test. If it is a two-sided hypothesis test, we use the rule \frac{\alpha}{2} \le p-value < 1- \frac{\alpha}{2} for non-rejection, i.e., we cannot reject the null hypothesis if the p-value is between \frac{\alpha}{2} and 1-\frac{\alpha}{2}.

Great! For the first bootstrap sample, if we were to verify the four hypotheses, we register the following.

Since the bootstrap sample mean \bar{x}_{boot}=67.8 is greater than the basis of 54, we register S_{i}=1.

Since the bootstrap sample median \tilde{x}_{boot}=69.5 is greater than the basis of 55, we register S_{i}=1.

Since the bootstrap sample standard deviation \sigma_{boot}=14.46 is less than the basis of 16.5, we register S_{i}=0.

Finally, since the bootstrap sample proportion p_{boot}=0.1 is less than the basis of 0.2, we register S_{i}=0.

We do this for a large number of bootstrap samples. Here is an illustration of the test statistics for three bootstrap replicates.

Let me run the hypothesis test on the mean using N = 10,000. I am creating 10,000 bootstrap-replicated test statistics.

The distribution of the test statistics is the null distribution of the mean. Notice that it resembles a normal distribution. The basis value of 54 is shown using a blue square on the distribution. From the null distribution, the proportion of times S_{i}=1 is 0.91. 91% of the \bar{x} test statistics are greater than 54.

Our null hypothesis is
H_{0}: P(\bar{x} > 54) = 0.5

Our alternate hypothesis is one-sided. H_{A}: P(\bar{x} > 54) < 0.5

Since the p-value is greater than a 5% rejection rate, we cannot reject the null hypothesis.

If the basis value \mu is far out on the null distribution of \bar{x} that less than 5% of the bootstrap-replicated test statistics are greater than \mu, we would have rejected the null hypothesis.

Shall we run the hypothesis test on the median?

H_{0}: P(\tilde{x} > 55) = 0.5

H_{A}: P(\tilde{x} > 55) < 0.5

Again, a one-sided test.

Sure. Here is the answer.

We can see the null distribution of the test statistic (median from the bootstrap samples) along with the basis value of 55.

86% of the test statistics are greater than this basis value. Hence, we cannot reject the null hypothesis.

The null distribution of the test statistic does not resemble any known distribution.

Yes. Since the bootstrap-based hypothesis test is distribution-free (non-parametric), not knowing the nature of the limiting distribution of the test statistic (median) does not restrain us.

Awesome. Let me also run the test for the standard deviation.

H_{0}: P(\sigma > 16.5) = 0.5

H_{A}: P(\sigma > 16.5) \ne 0.5

I am taking a two-sided test since a deviation in either direction, i.e., too small a standard deviation or too large of a standard deviation will disprove the hypothesis.

Here is the result.

The p-value is 0.85, i.e., 85% of the bootstrap-replicated test statistics are greater than 16.5. Since the p-value is greater than the acceptable rate of rejection, we cannot reject the null hypothesis.

If the p-value were less than 0.025 or greater than 0.975, then we would have rejected the null hypothesis.

For a p-value of 0.025, 97.5% of the bootstrap-replicated standard deviations will be less than 16.5 — strong evidence that the null distribution produces values much less than 16.5. For a p-value of 0.975, 97.5% of the bootstrap-replicated standard deviations will be greater than 16.5 — strong evidence that the null distribution produces values much greater than 16.5. In either of the sides, we reject the null hypothesis that the standard deviation is 16.5.

Let me complete the hypothesis test on the proportion.

H_{0}: P(p > 0.2) = 0.5

H_{A}: P(p > 0.2) \ne 0.5

Let’s take a two-sided test since deviation in either direction can disprove the null hypothesis. If we get a tiny proportion or a very high proportion compared to 0.2, we will reject the belief that the percentage of students obtaining a score of less than 40 is 0.2.

Here are the null distribution and the result from the test.

The p-value is 0.32. 3200 out of the 10000 bootstrap-replicated proportions are greater than 0.2. Since it is between 0.025 and 0.975, we cannot reject the null hypothesis.

You can see how widely the bootstrap concept can be applied for hypothesis testing and what flexibility it provides.

To summarize:

Repeatedly sample with replacement from the original sample data. 
Each time, draw a sample of size n.
Compute the desired statistic from each bootstrap sample.
(mean, median, standard deviation, interquartile range,
proportion, skewness, etc.)
Null hypothesis P(\theta > \theta^{*}) = 0.5 can now be tested as follows: 
S_{i}=1 if \theta_{i} > \theta^{*}, else, S_{i}=0

p-value = \frac{1}{N}\sum_{i=1}^{i=N} S_{i}
(average over all N bootstrap-replicated test statistics)

If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis
(for a two-sided hypothesis test at a selected rejection rate of \alpha)

If p-value < \alpha, reject the null hypothesis
(for a left-sided hypothesis test at a rejection rate of \alpha)

If p-value > 1 - \alpha, reject the null hypothesis
(for a right-sided hypothesis test at a rejection rate of \alpha)

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)