Lesson 92 – The Two-Sample Hypothesis Test – Part I

Fisher’s Exact Test

You may remember this from Lesson 38, where we derived the hypergeometric distribution from first principles.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

P(X=k) = \frac{\binom{R}{k}\binom{N-R}{R-k}}{\binom{N}{R}}

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution. The control parameters of the hypergeometric distribution are N and R.

For example, if there are five cans in total, out of which three are Pepsi cans, picking exactly two Pepsi cans can be done in \binom{3}{2}*\binom{2}{1} ways. Two Pepsi cans selected from three in \binom{3}{2} ways; one Coke can be selected from two Coke cans in \binom{2}{1}.

The overall possibilities of selecting three cans from a total of five cans are \binom{5}{3}.

Hence, P(X=2)=\frac{\binom{3}{2}*\binom{2}{1}}{\binom{5}{3}}=\frac{6}{10}


Now, suppose there are eight cans out of which four are Pepsi, and four are Coke, and we are testing John’s ability to identify Pepsi.

Since John has a better taste for Pepsi, he claims that he has a greater propensity to identify Pepsi from the hidden cans.

Of course, we don’t believe it, and we think his ability to identify Pepsi is no different than his ability to identify Coke.

Suppose his ability (probability) to identify Pepsi is p_{1} and his ability to identify Coke is p_{2}. We think p_{1}=p_{2} and John thinks p_{1} > p_{2}.

The null hypothesis that we establish is
H_{0}: p_{1} = p_{2}

John has an alternate hypothesis
H_{A}: p_{1} > p_{2}

Pepsi and Coke cans can be considered as two samples of four each.

Since there are two samples (Pepsi and Coke) and two outcomes (identifying or not identifying Pepsi), we can create a 2×2 contingency table like this.

John now identifies four cans as Pepsi out of the eight cans whose identity is hidden as in the fun experiment.

It turns out that the result of the experiment is as follows.

John correctly identified three Pepsi cans out of the four.

The probability that he will identify three correctly while sampling from a total of eight cans is

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}=\frac{\frac{4!}{1!3!}\frac{4!}{3!1!}}{\frac{8!}{4!4!}}=\frac{16}{70}=0.2286

If you recall from the prior hypothesis test lessons, you will ask for the null distribution. The null distribution is the probability distribution of observing any number of Pepsi cans while selecting from a total of eight cans (out of which four are known to be Pepsi). This will be the distribution that shows P(X=0), P(X=1), P(X=2), P(X=3), and P(X=4). Let’s compute these and present them visually.

P(X=0)=\frac{\binom{4}{0}*\binom{4}{4}}{\binom{8}{4}}==\frac{1}{70}=0.0143

P(X=1)=\frac{\binom{4}{1}*\binom{4}{3}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=2)=\frac{\binom{4}{2}*\binom{4}{2}}{\binom{8}{4}}==\frac{36}{70}=0.5143

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=4)=\frac{\binom{4}{4}*\binom{4}{0}}{\binom{8}{4}}==\frac{1}{70}=0.0143

In a hypergeometric null distribution with N = 8 and R = 4, what is the probability of getting a larger value than 3? If this has a sufficiently low probability, we cannot say that it may occur by chance.

This probability is the p-value. It is the probability of obtaining the computed test statistic under the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

P(X \ge 3)=P(X=3) + P(X=4) = 0.2286+0.0143=0.2429

Let us select a rate of error \alpha of 10%.

Since the p-value (0.2429) is greater than our selected rate of error (0.1), we cannot reject the null hypothesis that the probability of choosing Pepsi and the probability of choosing Coke are the same.

John should have selected all four Pepsi cans for us to be able to reject the null hypothesis (H_{0}: p_{1} = p_{2}) in favor of the alternative hypothesis (H_{A}: p_{1} > p_{2}) conclusively.


The Famous Fisher Test

We just saw a variant of the famous test conducted by Ronald Fisher in 1919 when he devised an offhand test of a lady’s ability to differentiate between tea prepared in two different ways.

One afternoon, at tea-time in Rothamsted Field Station in England, a lady proclaimed that she preferred her tea with the milk poured into the cup after the tea, rather than poured into the cup before the tea. Fisher challenged the lady and presented her with eight cups of tea; four made the way she preferred, and four made the other way. She was told that there were four of each kind and asked to determine which four were prepared properly. Fisher subsequently used this experiment to illustrate the basic issues in experimentation.

sourced from Chapter 5 of “Teaching Statistics, a bag of tricks” by Andrew Gelman and Deborah Nolan

This test, now popular as Fisher’s Exact Test, is the basis for the two-sample hypothesis test to verify the difference in proportions. Just like how the proportion (p) for the one-sample test followed a binomial null distribution, the test-statistic for the two-sample test follows a hypergeometric distribution when H_{0} is true.

Here, where we know the exact number of correct Pepsi cans, the true distribution of the test-statistic (number of correct Pepsi cans) is hypergeometric. In more generalized cases where the number of successes is not known apriori, we need to make some assumptions.


Say there are two samples represented by random variables X_{1} and X_{2} with sample sizes n_{1} and n_{2}. The proportion p_{1} is based on the number of successes (x_{1}) in X_{1}, and the proportion p_{2} is based on the number of successes (x_{2}) in X_{2}. Let the total number of successes in both the samples be t=x_{1}+x_{2}.

If the null hypothesis is H_{0}: p_{1} = p_{2}, then, large values of x_{1} and small values of x_{2} support the alternate hypothesis that H_{A}: p_{1} > p_{2} when t is fixed.

In other words, for a fixed value of t=x_{1}+x_{2}, we reject H_{0}: p_{1} = p_{2}, if there are more successes in X_{1} compared to X_{2}.

So the question is: what is the probability distribution of x_{1} when the total successes are fixed at t, and we have a total of n_{1}+n_{2} samples.

When the number of successes is t, and when H_{0}: p_{1} = p_{2} is true, these successes can come from any of the two random variables with equal likelihood.

A total sample of n_{1}+n_{2} exists out of which the number of ways of choosing n_{1} samples is \binom{n_{1}+n_{2}}{n_{1}}.

A total of t successes exist, out of which the number of ways of choosing k is \binom{t}{k}.

A total of n_{1}+n_{2}-t non-successes exist, out of which the number of ways of choosing n_{1}-k is \binom{n_{1}+n_{2}-t}{n_{1}-k}.

When we put them together, we can derive the probability P(X=k) for the hypergeometric distribution when H_{0} is true.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

Conditional on a total number of t successes that can come from any of the two random variables, the number of successes X=k in the first sample has a hypergeometric distribution when H_{0} is true.
The p-value can thus be derived.

We begin diving into the two-sample tests. Fisher’s Exact Test and its generalization (with assumptions) for the two-sample hypothesis test on the proportion is the starting point. It is a direct extension of the one-sample hypothesis test on proportion — albeit with some assumptions. Assumptions are crucial for the two-sample hypothesis tests. As we study the difference in the parameters of the two populations, we will delve into them more closely.
STAY TUNED!

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)