Difference in proportions – dataanalysisclassroom

Lesson 99 – The Two-Sample Hypothesis Tests in R

Over the past seven lessons, we equipped ourselves with the necessary theory of the two-sample hypothesis tests.

Lessons 92 and 93 were about the hypothesis test on the difference in proportions. In Lesson 92, we learned Fisher’s Exact Test to verify the difference in proportions. In Lesson 93, we did the same using the normal distribution as a limiting distribution when the sample sizes are large.

Lessons 94, 95, and 96 were about the hypothesis test on the difference in means. In Lesson 94, we learned that under the proposition that the population variances of two random variables are equal, the test-statistic, which uses the pooled variance, follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. In Lesson 95, we learned that Welch’s t-Test could be used when we cannot make the equality of population proportions assumption. In Lesson 96, we learned the Wilcoxon’s Rank-sum Test that used a ranking approach to approximate the significance of the differences in means. This method is data-driven, needs no assumptions on the limiting distribution, and works well for small sample sizes.

Lesson 97 was about the hypothesis test on the equality of variances. Here we got a sneak-peak into a new distribution called F-distribution, which is the converging distribution of the ratio of two Chi-square distributions divided by their respective degrees of freedom. The test-statistic is the ratio of the sample variances, which we can verify against an F-distribution with $n_{1}-1$ numerator degrees of freedom and $n_{2}-1$ denominator degrees of freedom.

All these hypothesis tests can also be done using the bootstrap approach, which we learned in Lesson 98. It uses the data at hand to generate the null distribution of any desired statistic as long as it is computable from the data. It is very flexible. There is no need to make any assumptions on the data’s distributional nature or the limiting distribution for the test-statistic.

It is now time to put all of this learning to practice.

The City of New York Department of Sanitation (DSNY) archives data on the monthly tonnage of trash collected from NYC residences and institutions.

To help us with today’s workout, I dug up our gold mine, the “Open Data for All New Yorkers” page, and found an interesting dataset on the monthly tonnage of trash collected from NYC residences and institutions.

Here is a preview.

The data includes details on when (year and month) DSNY collected the trash, from where (borough and community district), and how much (tonnage of refuse, source-separated recyclable paper, and source-separated metal, glass, plastic, and beverage cartons). They also have other seasonal data such as the tonnage of leaves collected in November and December and tons of Christmas trees collected from curbside in January.

For our workout today, we can use this file named “DSNY_Monthly_Tonnage_Data.csv.” It is a slightly modified version of the file you will find from the open data page.

My interest is in Manhattan’s community district 9 — Morningside Heights, Manhattanville, and Hamilton Heights.

I want to compare the tonnage of trash collected in winter to that in summer. Let’s say we compare the data in Manhattan’s community district 9 for February and August to keep it simple.

Are you ready?

The Basic Steps

Step1: Get the data

You can get the data from here. The file is named DSNY_Monthly_Tonnage_Data.csv. It is a comma-separated values file which we can conveniently read into the R workspace.

Step 2: Create a new folder on your computer

And call this folder “lesson99.”
Make sure that the data file “DSNY_Monthly_Tonnage_Data.csv” is in this folder.

Step 3: Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson99” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson99_code.R”

Step 4: Choose your working directory

“lesson99” is the folder where we stored the code and the input data file. Use setwd("path") to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Or, if you opened this “lesson99_code.R” file from within your “lesson99” folder, you can also use the following line to set your working directory.

setwd(getwd())

getwd() gets the current working directory. If you opened the file from within your director, then getwd() will resort to the working directory you want to set.

Step 5: Read the data into R workspace

Execute the following command to read your input .csv file.

# Read the data file
nyc_trash_data = read.csv("DSNY_Monthly_Tonnage_Data.csv",header=T)

Since the input file has a header for each column, we should have header=T in the command when we read the file to ensure that all the rows are correctly read.

Your RStudio interface will look like this when you execute this line and open the file from the Global Environment.

Step 6: Extracting a subset of the data

As I said before, for today’s workout, let’s use the data from Manhattan’s community district 9 for February and August. Execute the following lines to extract this subset of the data.

# Extract February refuse data for Manhattan's community district 9 
feb_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==2))

feb_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[feb_index]

# Extract August refuse data for Manhattan's community district 9
aug_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==8))

aug_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[aug_index]

In the first line, we look up for the row index of the data corresponding to Borough=Manhattan, Community District = 9, and Month = 2.

In the second line, we extract the data on refuse tons collected for these rows.

The next two lines repeat this process for August data.

Sample 1, February data for Manhattan’s community district 9 looks like this:
2953.1, 2065.8, 2668.2, 2955.4, 2799.4, 2346.7, 2359.6, 2189.4, 2766.1, 3050.7, 2175.1, 2104.0, 2853.4, 3293.2, 2579.1, 1979.6, 2749.0, 2871.9, 2612.5, 455.9, 1951.8, 2559.7, 2753.8, 2691.0, 2953.1, 3204.7, 2211.6, 2857.9, 2872.2, 1956.4, 1991.3

Sample 2, August data for Manhattan’s community district 9 looks like this:
2419.1, 2343.3, 3032.5, 3056.0, 2800.7, 2699.9, 3322.3, 3674.0, 3112.2, 3345.1, 3194.0, 2709.5, 3662.6, 3282.9, 2577.9, 3179.1, 2460.9, 2377.1, 3077.6, 3332.8, 2233.5, 2722.9, 3087.8, 3353.2, 2992.9, 3212.3, 2590.1, 2978.8, 2486.8, 2348.2

The values are in tons per month, and there are 31 values for sample 1 and 30 values for sample 2. Hence, $n_{1}=31$ and $n_{2}=30$ .

The Questions for Hypotheses

Let’s ask the following questions.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?

We can visualize their distributions. Execute the following lines to create a neat graphic of the boxplots of sample 1 and sample 2.

# Visualizing the distributions of the two samples
boxplot(cbind(feb_refuse_data,aug_refuse_data),horizontal=T, main="Refuse from Manhattan's Community District 9")
text(1500,1,"February Tonnage",font=2)
text(1500,2,"August Tonnage",font=2)

p_threshold = 2500 # tons of refuse
abline(v=p_threshold,lty=2,col="maroon")

You should now see your plot space changing to this.

Next, you can execute the following lines to compute the preliminaries required for our tests.

# Preliminaries
  # sample sizes
  n1 = length(feb_refuse_data)
  n2 = length(aug_refuse_data)
  
  # sample means
  x1bar = mean(feb_refuse_data)
  x2bar = mean(aug_refuse_data)

  # sample variances
  x1var = var(feb_refuse_data)
  x2var = var(aug_refuse_data)

  # sample proportions
  p1 = length(which(feb_refuse_data < p_threshold))/n1
  p2 = length(which(aug_refuse_data < p_threshold))/n2

Hypothesis Test on the Difference in Means

We learned four methods to verify the hypothesis on the difference in means:

The two-sample t-Test
Welch’s two-sample t-Test
Wilcoxon’s Rank-sum Test
The Bootstrap Test

Let’s start with the two-sample t-Tests. Assuming a two-sided alternative, the null and alternate hypothesis are:

$H_{0}: \mu_{1} - \mu_{2}=0$

$H_{A}: \mu_{1} - \mu_{2} \neq 0$

The alternate hypothesis indicates that substantial positive or negative differences will allow the rejection of the null hypothesis.

We know that for the two-sample t-Test, we assume that the population variances are equal, and the test-statistic is $t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ , where $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ is the pooled variance. $t_{0}$ follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

Execute the following lines in R.

pooled_var = ((n1-1)/(n1+n2-2))*x1var + ((n2-1)/(n1+n2-2))*x2var

t0 = (x1bar-x2bar)/sqrt(pooled_var*((1/n1)+(1/n2)))

df = n1+n2-2

pval = pt(t0,df=df)

print(pooled_var)
print(df)
print(t0)
print(pval)

We first computed the pooled variance $s^{2}$ and, using it, the test-statistic $t_{0}$ . Then, based on the degrees of freedom, we compute the p-value.

The p-value is 0.000736. For a two-sided test, we compare this with $\frac{\alpha}{2}$ . If we opt for a rejection rate $\alpha$ of 5%, then, since our p-value of 0.000736 is less than 0.025, we reject the null hypothesis that the means are equal.

All these steps can be implemented using a one-line command in R.

Try this:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = TRUE)

You will get the following prompt in the console.

Two Sample t-test

data: feb_refuse_data and aug_refuse_data

t = -3.3366, df = 59, p-value = 0.001472

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-658.2825 -164.7239

sample estimates:

mean of x mean of y

2510.697 2922.200

As inputs of the function, we need to provide the data (sample 1 and sample 2). We should also indicate that we evaluate a two-sided alternate hypothesis and that the population variances are equal. This will prompt the function to execute the standard two-sample t-test.

While the function provides additional information, all we need for our test are pointers from the first line. t=-3.3366 and df=59 are the test-statistic and the degrees of freedom that we computed earlier. The p-value looks different than what we computed. It is two times the value we estimated. The function provides a value that is double that of the original p-value to allow us to compare it to $\alpha$ instead of $\frac{\alpha}{2}$ . Either way, we know that we can reject the null hypothesis.

Next, let’s implement Welch’s t-Test. When the population variances are not equal, i.e., when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ , the test-statistic is $t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ , and it follows an approximate T-distribution with f degrees of freedom which can be estimated using the Satterthwaite’s equation: $f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}$

Execute the following lines.

f = (((x1var/n1)+(x2var/n2))^2)/(((x1var/n1)^2/(n1-1))+((x2var/n2)^2/(n2-1)))

t0 = (x1bar-x2bar)/sqrt((x1var/n1)+(x2var/n2))

pval = pt(t0,df=f)

print(f)
print(t0)
print(pval)

The pooled degrees of freedom is 55, the test-statistic is -3.3528, and the p-value is 0.000724. Comparing this with 0.025, we can reject the null hypothesis.

Of course, all this can be done using one line where you indicate that the variances are not equal:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = FALSE)

When you execute the above line, you will see the following on your console.

Welch Two Sample t-test
data: feb_refuse_data and aug_refuse_data
t = -3.3528, df = 55.338, p-value = 0.001448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-657.4360 -165.5705
sample estimates:
mean of x mean of y
2510.697 2922.200

Did you notice that it runs Welch’s two-sample t-Test when we indicate that the variances are not equal? Like before, we can compare the p-value to 0.05 and reject the null hypothesis.

The mean of the February refuse collected from Manhattan’s community district 9 is different from August. Since the test-statistic is negative and the p-value is lower than the chosen level of rejection rate, we can conclude that the mean of February refuse tonnage is significantly lower than the mean of August refuse tonnage beyond a reasonable doubt.

How about Wilcoxon’s Rank-sum Test? The null hypothesis is based on the proposition that if $x_{1}$ and $x_{2}$ are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

$H_{0}: P(x_{1} > x_{2}) = 0.5$

$H_{0}: P(x_{1} > x_{2}) \neq 0.5$

We know that this ranking and coming up with the rank-sum table is tedious for larger sample sizes. For larger sample sizes, the null distribution of the test-statistic W, which is the sum of the ranks associated with the variable of smaller sample size in the pooled and ordered data, tends to a normal distribution. This allows the calculation of the p-value by comparing W to a normal distribution. We can use the following command in R to implement this.

wilcox.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will get the following message in the console when you execute the above line.

Wilcoxon rank sum test with continuity correction
data: feb_refuse_data and aug_refuse_data
W = 248, p-value = 0.001788
alternative hypothesis: true location shift is not equal to 0

Since the p-value of 0.001788 is lower than 0.05, the chosen level of rejection, we reject the null hypothesis that $x_{1}$ and $x_{2}$ are samples from the same distribution.

Finally, let’s take a look at the bootstrap method. The null hypothesis is that there is no difference between the means.

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) = 0.5$

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) \neq 0.5$

We first create a bootstrap replicate of X and Y by randomly drawing with replacement $n_{1}$ values from X and $n_{2}$ values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ and check whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ , and verifying whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ and registering $S_{i} \in (0,1)$ a large number of times, say N=10,000.

The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated statistics is the p-value.

Execute the following lines in R to implement these steps.

#4. Bootstrap
  N = 10000
  null_mean = matrix(0,nrow=N,ncol=1)
  null_mean_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_mean_ratio[i] = mean(xboot)/mean(yboot) 
     if(mean(xboot)>mean(yboot)){null_mean[i]=1} 
   }
 
 pvalue_mean = sum(null_mean)/N
 hist(null_mean_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="Xbar/Ybar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(0.95,1000,paste("p-value=",pvalue_mean),col="red")

Your RStudio interface will then look like this:

A vertical bar will show up at a ratio of 1 to indicate that the area beyond this value is the proportion of times $S_{i} = 1$ in 10,000 bootstrap-replicated means.

The p-value is close to 0. We reject the null hypothesis since the p-value is less than 0.025. Since more than 97.5% of the times, the mean of February tonnage is less than August tonnage; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

In summary, while the standard t-Test, Welch's t-Test, and Wilcoxon's Rank-sum Test can be implemented in R using single-line commands,

t.test(x1,x2,alternative="two.sided",var.equal = TRUE)
t.test(x1,x2,alternative="two.sided",var.equal = FALSE)
wilcox.test(x1,x2,alternative = "two.sided")

the bootstrap test needs a few lines of code.

Hypothesis Test on the Equality of Variances

We learned two methods to verify the hypothesis on the equality of variances; using F-distribution and using the bootstrap method.

For the F-distribution method, the null and the alternate hypothesis are

$H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}$

$H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}$

The test-statistic is the ratio of the sample variances, $f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}$

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic $f_{0}$ in the null distribution.

Execute the following lines.

# 1. F-Test
  f0 = x1var/x2var
  
  df_numerator = n1-1
  
  df_denominator = n2-1
  
  pval = 1-pf(f0,df1=df_numerator,df2=df_denominator)
  
  print(f0)
  print(df_numerator)
  print(df_denominator)
  print(pval)

The test-statistic, i.e., the ratio of the sample variances, is 1.814.

The p-value, which, in this case, is the probability of finding a value greater than the test-statistic ( $P(F > f_{0})$ ), is 0.0562. When compared to a one-sided rejection rate ( $\frac{\alpha}{2}$ ) of 0.025, we cannot reject the null hypothesis.

We do have a one-liner command.

var.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will find the following message when you execute the var.test() command.

F test to compare two variances
data: feb_refuse_data and aug_refuse_data
F = 1.814, num df = 30, denom df = 29, p-value = 0.1125
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8669621 3.7778616
sample estimates:
ratio of variances
1.813959

As usual, with the standard R function, we need to compare the p-value (which is two times what we computed based on the right-tail probability) with $\alpha=0.05$ . Since 0.1125 > 0.05, we cannot reject the null hypothesis that the variances are equal.

Can we come to the same conclusion using the bootstrap test? Let’s try. Execute the following lines in R.

# 2. Bootstrap
 N = 10000
   null_var = matrix(0,nrow=N,ncol=1)
   null_var_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_var_ratio[i] = var(xboot)/var(yboot) 
     if(var(xboot)>var(yboot)){null_var[i]=1} 
   }
 
 pvalue_var = sum(null_var)/N
 hist(null_var_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="XVar/YVar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,500,paste("p-value=",pvalue_var),col="red")

Your RStudio interface should look like this if you correctly coded up and executed the lines. The code provides a way to visualize the null distribution.

The p-value is 0.7945. 7945 out of the 10,000 bootstrap replicates had $s^{2}_{Feb}>s^{2}_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.025. The evidence (20.55% of the times) that the February tonnage variance is less than August tonnage is not sufficient to reject equality.

In summary, use var.test(x1,x2,alternative = "two.sided") if you want to verify based on F-distribution, or code up a few lines if you want to verify using the bootstrap method.

Hypothesis Test on the Difference in Proportions

For the hypothesis test on the difference in proportions, we can employ Fisher’s Exact Test, use the normal approximation under the large-sample assumption, or the bootstrap method.

For Fisher’s Exact Test, the test-statistic follows a hypergeometric distribution when $H_{0}$ is true. We could assume that the number of successes is fixed at $t=x_{1}+x_{2}$ , and, for a fixed value of $t$ , we reject $H_{0}:p_{1}=p_{2}$ for the alternate hypothesis $H_{A}:p_{1}>p_{2}$ if there are more successes in random variable $X_{1}$ compared to $X_{2}$ .

The p-value can be derived under the assumption that the number of successes $X=k$ in the first sample $X_{1}$ has a hypergeometric distribution when $H_{0}$ is true and conditional on a total number of t successes that can come from any of the two random variables $X_{1}$ and $X_{2}$ .

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Execute the following lines to implement this process, plot the null distribution and compute the p-value.

#Fisher's Exact Test
  x1 = length(which(feb_refuse_data < p_threshold))
  p1 = x1/n1
  
  x2 = length(which(aug_refuse_data < p_threshold))
  p2 = x2/n2
  
  N = n1+n2
  t = x1+x2
  
  k = seq(from=0,to=n1,by=1)
  p = k
  for(i in 1:length(k)) 
   {
    p[i] = (choose(t,k[i])*choose((N-t),(n1-k[i])))/choose(N,n1)
   }

  plot(k,p,type="h",xlab="Number of successes in X1",ylab="P(X=k)",font=2,font.lab=2)
  points(k,p,type="o",lty=2,col="grey50")
  points(k[13:length(k)],p[13:length(k)],type="o",col="red",lwd=2)
  points(k[13:length(k)],p[13:length(k)],type="h",col="red",lwd=2)
  pvalue = sum(p[13:length(k)])
  print(pvalue)

Your RStudio should look like this once you run these lines.

The p-value is 0.1539. Since it is greater than the rejection rate, we cannot reject the null hypothesis that the proportions are equal.

We can also leverage an in-built R function that does the same. Try this.

fisher_data = cbind(c(x1,x2),c((n1-x1),(n2-x2)))
fisher.test(fisher_data,alternative="greater")

In the first line, we create a 2×2 contingency table to indicate the number of successes and failures in each sample. Sample 1 has 12 successes, i.e., 12 times the tonnage in February was less than 2500 tons. Sample 2 has seven successes. So the contingency table looks like this:

    [,1]  [,2]
[1,] 12    19
[2,]  7    23

Then, fisher.test() funtion implements the hypergeometric distribution calculations.

You will find the following prompt in your console.

Fisher’s Exact Test for Count Data
data: fisher_data
p-value = 0.1539
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.712854 Inf
sample estimates:
odds ratio
2.050285

We get the same p-value as before — we cannot reject the null hypothesis.

Under the normal approximation for large sample sizes, the test-statistic $z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)$ . $p$ is the pooled proportion which can be compute as $p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}$ .

We reject the null hypothesis when the p-value $P(Z \ge z)$ is less than the rate of rejection $\alpha$ .

Execute the following lines to set this up in R.

# Z-approximation
  p = (x1+x2)/(n1+n2)
  z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2)))
  pval = 1-pnorm(z)

The p-value is 0.0974. Since it is greater than 0.05, we cannot reject the null hypothesis.

Finally, we can use the bootstrap method as follows.

# Bootstrap
 N = 10000
 null_prop = matrix(0,nrow=N,ncol=1)
 null_prop_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
 
     p1boot = length(which(xboot < p_threshold))/n1 
     p2boot = length(which(yboot < p_threshold))/n2 

     null_prop_ratio[i] = p1boot/p2boot 
    if(p1boot>p2boot){null_prop[i]=1} 
  }
 
 pvalue_prop = sum(null_prop)/N
 hist(null_prop_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="P1/P2",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,250,paste("p-value=",pvalue_prop),col="red")

We evaluated whether the low trash situation in February is different than August.

$H_{0}: P(p_{Feb}>p_{Aug}) = 0.5$

$H_{A}: P(p_{Feb}>p_{Aug}) > 0.5$

The p-value is 0.8941. 8941 out of the 10,000 bootstrap replicates had $p_{Feb}>p_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is not greater than 0.95. The evidence (89.41% of the times) that the low trash situation in February is greater than August is insufficient to reject equality.

Your Rstudio interface will look like this.

In summary, 
use fisher.test(contingency_table,alternative="greater") 
if you want to verify based on Fisher's Exact Test; 

use p = (x1+x2)/(n1+n2); 
    z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2))); 
    pval = 1-pnorm(z) 
if you want to verify based on the normal approximation; 

code up a few lines if you want to verify using the bootstrap method.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
- Reject the null hypothesis. They are different.
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
- Cannot reject the null hypothesis. They are probably the same.
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?
- Cannot reject the null hypothesis. They are probably the same.

Please take the next two weeks to digest all the two-sample hypothesis tests, including how to execute them in R.

Here is the full code for today’s lesson.

In a little over four years, we reached 99 lessons 😎

I will have something different for the 100. Stay tuned …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 93 – The Two-Sample Hypothesis Test – Part II

On the Difference in Proportions

$H_{0}: p_{1}-p_{2} = 0$

$H_{A}: p_{1}-p_{2} > 0$

$H_{A}: p_{1}-p_{2} < 0$

$H_{A}: p_{1}-p_{2} \neq 0$

Joe and Mumble are interested in getting people’s opinion on the preference for a higher than 55 mph speed limit for New York State.

Joe spoke to **ten** of his rural friends, of which **seven** supported the idea of increasing the speed limit to 65 mph. Mumble spoke to **eighteen** of his urban friends, of which **five** favored a speed limit of 65 mph over the current limit of 55 mph.

Can we say that the sentiment for increasing the speed limit is stronger among rural than among urban residents?

We can use a hypothesis testing framework to address this question.

Last week, we learned how Fisher’s Exact test could be used to verify the difference in proportions. The test-statistic for the two-sample hypothesis test follows a hypergeometric distribution when $H_{0}$ is true.

We also learned that, in more generalized cases where the number of successes is not known apriori, we could assume that the number of successes is fixed at $t=x_{1}+x_{2}$ , and, for a fixed value of $t$ , we reject $H_{0}:p_{1}=p_{2}$ for the alternate hypothesis $H_{A}:p_{1}>p_{2}$ if there are more successes in random variable $X_{1}$ compared to $X_{2}$ .

In short, the p-value can be derived under the assumption that the number of successes $X=k$ in the first sample $X_{1}$ has a hypergeometric distribution when $H_{0}$ is true and conditional on a total number of t successes that can come from any of the two random variables $X_{1}$ and $X_{2}$ .

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Let’s apply this principle to the two samples that Joe and Mumble collected.

Let $X_{1}$ be the random variable that denotes Joe’s rural sample. He surveyed a total of $n_{1}=10$ people and $x_{1}=7$ favored an increase in the speed limit. So the proportion $p_{1}$ based on the number of successes is 0.7.

Let $X_{2}$ be the random variable that denotes Mumble’s urban sample. He surveyed a total of $n_{2}=18$ people. $x_{2}=5$ out of the 18 favored an increase in the speed limit. So the proportion $p_{2}$ based on the number of successes is 0.2778.

Let the total number of successes in both the samples be $t=x_{1}+x_{2}=7+5=12$ .

Let’s also establish the null and alternate hypotheses.

$H_{0}: p_{1}-p_{2}=0$

$H_{A}: p_{1}-p_{2}>0$

The alternate hypothesis says that the sentiment for increasing the speed limit is stronger among rural ( $p_{1}$ ) than among urban residents ( $p_{2}$ ).

Larger values of $x_{1}$ and smaller values of $x_{2}$ support the alternate hypothesis $H_{A}$ that $p_{1}>p_{2}$ when t is fixed.

For a fixed value of t, we reject $H_{0}$ , if there are more number of successes in $X_{1}$ compared to $X_{2}$ .

Conditional on a total number of t successes from any of the two random variables, the number of successes $X=k$ in the first sample has a hypergeometric distribution when $H_{0}$ is true.

In the rural sample that Joe surveyed, seven favored an increase in the speed limit. So we can compute the p-value as the probability of obtaining more than seven successes in a rural sample of 10 when the total successes t from either urban or rural samples are twelve.

$p-value=P(X \ge k) = P(X \ge 7)$

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

$P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{10+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338$

A total of 12 successes exist, out of which the number of ways of choosing 7 is $\binom{12}{7}$ .

A total of 28 – 12 = 16 non-successes exist, out of which the number of ways of choosing 10 – 7 = 3 non-successes is $\binom{16}{3}$ .

A total sample of 10 + 18 = 28 exists, out of which the number of ways of choosing ten samples is $\binom{28}{10}$ .

When we put them together, we can derive the probability $P(X=7)$ for the hypergeometric distribution when $H_{0}$ is true.

$P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{18+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338$

Applying the same logic for k = 8, 9, and 10, we can derive their respective probabilities.

$P(X=8) = \frac{\binom{12}{8}\binom{10+18-12}{10-8}}{\binom{10+18}{10}} =\frac{\binom{12}{8}\binom{16}{2}}{\binom{28}{10}} = 0.0045$

$P(X=9) = \frac{\binom{12}{9}\binom{10+18-12}{10-9}}{\binom{10+18}{10}} =\frac{\binom{12}{9}\binom{16}{1}}{\binom{28}{10}} = 0.0003$

$P(X=10) = \frac{\binom{12}{10}\binom{10+18-12}{10-10}}{\binom{10+18}{10}} =\frac{\binom{12}{10}\binom{16}{0}}{\binom{28}{10}} = 5.029296*10^{-6}$

The p-value can be computed as the sum of these probabilities.

$p-value=P(X \ge k) = P(X = 7)+P(X = 8)+P(X = 9)+P(X = 10)=0.0386$

Visually, the null distribution will look like this.

The x-axis shows the number of possible successes in $X_{1}$ . They range from k = 0 to k = 10. The vertical bars are showing $P(X=k)$ as derived from the hypergeometric distribution. The area highlighted in red is the *p-value*, the probability of finding $\ge$ seven successes in a rural sample of 10 people.

The p-value is the probability of obtaining the computed test statistic under the null hypothesis.

The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

Suppose we select a rate of error $\alpha$ of 5%.

Since the p-value (0.0386) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural ( $p_{1}$ ) than among urban residents ( $p_{2}$ ).

Let me remind you that this decision is based on the assumption that the null hypothesis is correct. Under this assumption, since we selected , we will reject the true null hypothesis 5% of the time. At the same time, we will fail to reject the null hypothesis 95% of the time. In other words, 95% of the time, our decision to not reject the null hypothesis will be correct.

What if Joe and Mumble surveyed many more people?

You must be wondering that Joe and Mumble surveyed just a few people, which is not enough to derive any decent conclusion for a question like this. Perhaps they just called up their friends!

Let’s do a thought experiment. How would the null distribution look like if Joe and Mumble had double the sample size and the successes also increase in the same proportion? Would the p-value change?

Say Joe had surveyed 20 people, and 14 had favored an increase in the speed limit. $n_{1} = 20; x_{1} = 14; p_{1} = 0.7$ .

Say Mumble had surveyed 36 people, and 10 had favored an increase in the speed limit. $n_{2} = 36; x_{2} = 10; p_{2} = 0.2778$ .

p-value will then be $P(X \ge 14)$ when there are 24 total successes.

The null distribution will look like this.

Notice that the null distribution is much more symmetric and looks like a bell curve (normal distribution) with an increase in the sample size. The p-value is 0.0026. More substantial evidence for rejecting the null hypothesis.

Is there a limiting distribution for the difference in proportion? If there is one, can we use it as the null distribution for the hypothesis test on the difference in proportion when the sample sizes are large.

While we embark on this derivation, let’s ask Joe and Mumble to survey many more people. When they are back, we will use new data to test the hypothesis.

But first, what is the limiting distribution for the difference in proportion?

We have two samples $X_{1}$ and $X_{2}$ of sizes $n_{1}$ and $n_{2}$ .

We might observe $x_{1}$ and $x_{2}$ successes in each of these samples. Hence, the proportions $p_{1}, p_{2}$ can be estimated using $\hat{p_{1}} = \frac{x_{1}}{n_{1}}$ and $\hat{p_{2}} = \frac{x_{2}}{n_{2}}$ .

See, we are using $\hat{p_{1}}, \hat{p_{2}}$ as the estimates of the true proportions $p_{1}, p_{2}$ .

Take $X_{1}$ . If the probability of success (proportion) is $p_{1}$ , in a sample of $n_{1}$ , we could observe $x_{1}=0, 1, 2, 3, \cdots, n_{1}$ successes with a probabilty $P(X=x_{1})$ that is governed by a binomial distribution. In other words,

$x_{1} \sim Bin(n_{1},p_{1})$

Same logic applies to $X_{2}$ .

$x_{2} \sim Bin(n_{2},p_{2})$

A binomial distribution tends to a normal distribution for large sample sizes; it can be estimated very accurately using the normal density function. We learned this in Lesson 48.

If you are curious as to how a binomial distribution function $f(x)=\frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}$ can approximated to a normal density function $f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}$ , look at this link.

But what is the limiting distribution for  and ?

$x_{1}$ is the sum of $n_{1}$ independent Bernoulli random variables (yes or no responses from the people). For a large enough sample size $n_{1}$ , the distribution function of $x_{1}$ , which is a binomial distribution, can be well-approximated by the normal distribution. Since $\hat{p_{1}}$ is a linear function of $x_{1}$ , the random variable $\hat{p_{1}}$ can also be assumed to be normally distributed.

When both $\hat{p_{1}}$ and $\hat{p_{2}}$ are normally distributed, and when they are independent of each other, their sum or difference will also be normally distributed. We can derive it using the convolution of $\hat{p_{1}}$ and $\hat{p_{2}}$ .

Let $Y = \hat{p_{1}}-\hat{p_{2}}$

$Y \sim N(E[Y], V[Y])$ since both $\hat{p_{1}}, \hat{p_{2}} \sim N()$

If $Y \sim N(E[Y], V[Y])$ , we can standardize it to a standard normal variable as

$Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)$

We should now derive the expected value $E[Y]$ and the variance $V[Y]$ of Y.

$Y = \hat{p_{1}}-\hat{p_{2}}$

$E[Y] = E[\hat{p_{1}}-\hat{p_{2}}] = E[\hat{p_{1}}] - E[\hat{p_{2}}]$

$V[Y] = V[\hat{p_{1}}-\hat{p_{2}}] = V[\hat{p_{1}}] + V[\hat{p_{2}}]$

Since they are independent, the co-variability term which carries the negative sign is zero.

We know that $E[\hat{p_{1}}] = p_{1}$ and $V[\hat{p_{1}}]=\frac{p_{1}(1-p_{1})}{n_{1}}$ . Recall Lesson 76.

When we put them together,

$E[Y] = p_{1} - p_{2}$

$V[Y] = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}$

and finally since $Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)$ ,

$Z = \frac{\hat{p_{1}} - \hat{p_{2}} - (p_{1} - p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}} \sim N(0, 1)$

A few more steps and we are done. Joe and Mumble must be waiting for us.

The null hypothesis is $H_{0}: p_{1}-p_{2}=0$ . Or, $p_{1}=p_{2}$ .

We need the distribution under the null hypothesis — the null distribution.

Under the null hypothesis, let’s assume that $p_{1}=p_{2}$ is $p$ , a common value for the two population proportions.

Then, the expected value of Y, $E[Y]=p_{1}-p_{2}=p-p = 0$ and the variance $V[Y] = \frac{p(1-p)}{n_{1}} + \frac{p(1-p)}{n_{2}}}$

$V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})$

This shared value $p$ for the two population proportions can be estimated by pooling the samples together into one sample of size $n_{1}+n_{2}$ where there are $x_{1}$ and $x_{2}$ total successes.

$p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}$

Look at this estimate carefully. Can you see that the pooled estimate $p$ is a weighted average of the two proportions ( $p_{1}$ and $p_{2}$ )?

.
.
.
Okay, tell me what $x_{1}$ and $x_{2}$ are? Aren’t they $n_{1}\hat{p_{1}}$ and $n_{2}\hat{p_{2}}$ for the given two samples?

So $p = \frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}=\frac{n_{1}}{n_{1}+n_{2}}\hat{p_{1}}+ \frac{n_{2}}{n_{1}+n_{2}}\hat{p_{2}}$

or, $p = w_{1}\hat{p_{1}}+ w_{2}\hat{p_{2}}$

At any rate,

$E[Y]= 0$

$V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})$

$p=\frac{x_{1}+x_{2}}{n_{1}+n_{2}}$

To summarize, when the null hypothesis is

$H_{0}:p_{1}-p_{2}=0$

for large sample sizes, the test-statistic $z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)$

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2}>0$ , we reject the null hypothesis when the p-value $P(Z \ge z)$ is less than the rate of rejection $\alpha$ . We can also say that when $z > z_{\alpha}$ , we reject the null hypothesis.

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2}<0$ , we reject the null hypothesis when the p-value $P(Z \le z)$ is less than the rate of rejection $\alpha$ . Or when $z < -z_{\alpha}$ , we reject the null hypothesis.

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2} \neq 0$ , we reject the null hypothesis when the p-value $P(Z \le z)$ or $P(Z \ge z)$ is less than the rate of rejection $\frac{\alpha}{2}$ . Or when $z < -z_{\frac{\alpha}{2}}$ or $z > z_{\frac{\alpha}{2}}$ , we reject the null hypothesis.

Okay, we are done. Let’s see what Joe and Mumble have.

The rural sample $X_{1}$ has $n_{1}=190$ and $x_{1}=70$ .

The urban sample $X_{2}$ has $n_{2}=310$ and $x_{2}=65$ .

Let’s first compute the estimates for the respective proportions — $p_{1}$ and $p_{2}$ .

$\hat{p_{1}}=\frac{x_{1}}{n_{1}}=\frac{70}{190} = 0.3684$

$\hat{p_{2}}=\frac{x_{2}}{n_{2}}=\frac{65}{310} = 0.2097$

Then, let’s compute the pooled estimate $p$ for the population proportions.

$p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}=\frac{70+65}{190+310}=\frac{135}{500}=0.27$

Next, let’s compute the test-statistics under the large-sample assumption. 190 and 310 are pretty large samples.

$z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$

$z = \frac{0.3684-0.2097}{\sqrt{0.27(0.73)*(\frac{1}{190}+\frac{1}{310})}}=3.8798$

Since our alternate hypothesis $H_{A}$ is $p_{1}-p_{2}>0$ , we compute the p-value as,
$p-value=P(Z \ge 3.8798) = 5.227119*10^{-5} \approx 0$

Since the p-value (~0) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural () than among urban residents ().

Remember that the test-statistic is computed for the null hypothesis that $p_{1}-p_{2}=0$ . What if the null hypothesis is not that the difference in proportions is zero but is equal to some value? $p_{1}-p_{2}=0.25$

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 92 – The Two-Sample Hypothesis Test – Part I

Fisher’s Exact Test

You may remember this from Lesson 38, where we derived the hypergeometric distribution from first principles.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

$P(X=k) = \frac{\binom{R}{k}\binom{N-R}{R-k}}{\binom{N}{R}}$

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution. The control parameters of the hypergeometric distribution are N and R.

For example, if there are five cans in total, out of which three are Pepsi cans, picking exactly two Pepsi cans can be done in $\binom{3}{2}*\binom{2}{1}$ ways. Two Pepsi cans selected from three in $\binom{3}{2}$ ways; one Coke can be selected from two Coke cans in $\binom{2}{1}$ .

The overall possibilities of selecting three cans from a total of five cans are $\binom{5}{3}$ .

Hence, $P(X=2)=\frac{\binom{3}{2}*\binom{2}{1}}{\binom{5}{3}}=\frac{6}{10}$

Now, suppose there are eight cans out of which four are Pepsi, and four are Coke, and we are testing John’s ability to identify Pepsi.

Since John has a better taste for Pepsi, he claims that he has a greater propensity to identify Pepsi from the hidden cans.

Of course, we don’t believe it, and we think his ability to identify Pepsi is no different than his ability to identify Coke.

Suppose his ability (probability) to identify Pepsi is $p_{1}$ and his ability to identify Coke is $p_{2}$ . We think $p_{1}=p_{2}$ and John thinks $p_{1} > p_{2}$ .

The null hypothesis that we establish is
$H_{0}: p_{1} = p_{2}$

John has an alternate hypothesis
$H_{A}: p_{1} > p_{2}$

Pepsi and Coke cans can be considered as two samples of four each.

Since there are two samples (Pepsi and Coke) and two outcomes (identifying or not identifying Pepsi), we can create a 2×2 contingency table like this.

John now identifies four cans as Pepsi out of the eight cans whose identity is hidden as in the fun experiment.

It turns out that the result of the experiment is as follows.

John correctly identified three Pepsi cans out of the four.

The probability that he will identify three correctly while sampling from a total of eight cans is

$P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}=\frac{\frac{4!}{1!3!}\frac{4!}{3!1!}}{\frac{8!}{4!4!}}=\frac{16}{70}=0.2286$

If you recall from the prior hypothesis test lessons, you will ask for the null distribution. The null distribution is the probability distribution of observing any number of Pepsi cans while selecting from a total of eight cans (out of which four are known to be Pepsi). This will be the distribution that shows P(X=0), P(X=1), P(X=2), P(X=3), and P(X=4). Let’s compute these and present them visually.

$P(X=0)=\frac{\binom{4}{0}*\binom{4}{4}}{\binom{8}{4}}==\frac{1}{70}=0.0143$

$P(X=1)=\frac{\binom{4}{1}*\binom{4}{3}}{\binom{8}{4}}==\frac{16}{70}=0.2286$

$P(X=2)=\frac{\binom{4}{2}*\binom{4}{2}}{\binom{8}{4}}==\frac{36}{70}=0.5143$

$P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}==\frac{16}{70}=0.2286$

$P(X=4)=\frac{\binom{4}{4}*\binom{4}{0}}{\binom{8}{4}}==\frac{1}{70}=0.0143$

In a hypergeometric null distribution with N = 8 and R = 4, what is the probability of getting a larger value than 3? If this has a sufficiently low probability, we cannot say that it may occur by chance.

This probability is the p-value. It is the probability of obtaining the computed test statistic under the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

$P(X \ge 3)=P(X=3) + P(X=4) = 0.2286+0.0143=0.2429$

Let us select a rate of error $\alpha$ of 10%.

Since the p-value (0.2429) is greater than our selected rate of error (0.1), we cannot reject the null hypothesis that the probability of choosing Pepsi and the probability of choosing Coke are the same.

John should have selected all four Pepsi cans for us to be able to reject the null hypothesis ( $H_{0}: p_{1} = p_{2}$ ) in favor of the alternative hypothesis ( $H_{A}: p_{1} > p_{2}$ ) conclusively.

The Famous Fisher Test

We just saw a variant of the famous test conducted by Ronald Fisher in 1919 when he devised an offhand test of a lady’s ability to differentiate between tea prepared in two different ways.

One afternoon, at tea-time in Rothamsted Field Station in England, a lady proclaimed that she preferred her tea with the milk poured into the cup after the tea, rather than poured into the cup before the tea. Fisher challenged the lady and presented her with eight cups of tea; four made the way she preferred, and four made the other way. She was told that there were four of each kind and asked to determine which four were prepared properly. Fisher subsequently used this experiment to illustrate the basic issues in experimentation.
sourced from Chapter 5 of “Teaching Statistics, a bag of tricks” by Andrew Gelman and Deborah Nolan

This test, now popular as Fisher’s Exact Test, is the basis for the two-sample hypothesis test to verify the difference in proportions. Just like how the proportion (p) for the one-sample test followed a binomial null distribution, the test-statistic for the two-sample test follows a hypergeometric distribution when $H_{0}$ is true.

Here, where we know the exact number of correct Pepsi cans, the true distribution of the test-statistic (number of correct Pepsi cans) is hypergeometric. In more generalized cases where the number of successes is not known apriori, we need to make some assumptions.

Say there are two samples represented by random variables $X_{1}$ and $X_{2}$ with sample sizes $n_{1}$ and $n_{2}$ . The proportion $p_{1}$ is based on the number of successes ( $x_{1}$ ) in $X_{1}$ , and the proportion $p_{2}$ is based on the number of successes ( $x_{2}$ ) in $X_{2}$ . Let the total number of successes in both the samples be $t=x_{1}+x_{2}$ .

If the null hypothesis is $H_{0}: p_{1} = p_{2}$ , then, large values of $x_{1}$ and small values of $x_{2}$ support the alternate hypothesis that $H_{A}: p_{1} > p_{2}$ when t is fixed.

In other words, for a fixed value of $t=x_{1}+x_{2}$ , we reject $H_{0}: p_{1} = p_{2}$ , if there are more successes in $X_{1}$ compared to $X_{2}$ .

So the question is: what is the probability distribution of $x_{1}$ when the total successes are fixed at $t$ , and we have a total of $n_{1}+n_{2}$ samples.

When the number of successes is t, and when $H_{0}: p_{1} = p_{2}$ is true, these successes can come from any of the two random variables with equal likelihood.

A total sample of $n_{1}+n_{2}$ exists out of which the number of ways of choosing $n_{1}$ samples is $\binom{n_{1}+n_{2}}{n_{1}}$ .

A total of t successes exist, out of which the number of ways of choosing k is $\binom{t}{k}$ .

A total of $n_{1}+n_{2}-t$ non-successes exist, out of which the number of ways of choosing $n_{1}-k$ is $\binom{n_{1}+n_{2}-t}{n_{1}-k}$ .

When we put them together, we can derive the probability $P(X=k)$ for the hypergeometric distribution when $H_{0}$ is true.

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Conditional on a total number of t successes that can come from any of the two random variables, the number of successes $X=k$ in the first sample has a hypergeometric distribution when $H_{0}$ is true.
The p-value can thus be derived.

We begin diving into the two-sample tests. Fisher’s Exact Test and its generalization (with assumptions) for the two-sample hypothesis test on the proportion is the starting point. It is a direct extension of the one-sample hypothesis test on proportion — albeit with some assumptions. Assumptions are crucial for the two-sample hypothesis tests. As we study the difference in the parameters of the two populations, we will delve into them more closely.
STAY TUNED!

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Tag: Difference in proportions

Lesson 99 – The Two-Sample Hypothesis Tests in R

The Basic Steps

Step1: Get the data

Step 2: Create a new folder on your computer

Step 3: Create a new code in R

Step 4: Choose your working directory

Step 5: Read the data into R workspace

Step 6: Extracting a subset of the data

The Questions for Hypotheses

Hypothesis Test on the Difference in Means

Hypothesis Test on the Equality of Variances

Hypothesis Test on the Difference in Proportions

Here is the full code for today’s lesson.

Lesson 93 – The Two-Sample Hypothesis Test – Part II

On the Difference in Proportions

What if Joe and Mumble surveyed many more people?

Okay, we are done. Let’s see what Joe and Mumble have.

Lesson 92 – The Two-Sample Hypothesis Test – Part I

Fisher’s Exact Test

The Famous Fisher Test

Enjoy this blog? Please spread the word :)