Lesson 95 – The Two-Sample Hypothesis Test – Part IV

On the Difference in Means

using Welch’s t-Test

H_{0}: \mu_{1} - \mu_{2} = 0

H_{A}: \mu_{1} - \mu_{2} > 0

H_{A}: \mu_{1} - \mu_{2} < 0

H_{A}: \mu_{1} - \mu_{2} \neq 0

On the 24th day of January 2021, we examined Tom’s hypothesis on the Mohawk Rivers’ arsenic levels.

After a lengthy expose on the fundamentals behind hypothesis testing on the difference in means using a two-sample t-Test, we concluded that Tom could not reject the null hypothesis H_{0}: \mu_{1}-\mu_{2}=0.

He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

However, Tom now has a new theory that the factory is illegally dumping their untreated waste into both the west and the east branches of the Mohawk River. So, he took Ron with him to collect data from Utica River, a tributary of Mohawk that branches off right before the factory.

If his new theory is correct, he should find that the mean arsenic concentration in either west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

There are now three samples whose concentration in parts per billion are:

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
n_{1}=10 | \bar{x_{1}} = 12.5 | s_{1}^{2} = 58.28

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
n_{2}=10 | \bar{x_{2}} = 12 | s_{2}^{2} = 54.89

Utica River: 4, 4, 6, 4, 5, 7, 8
n_{3}=7 | \bar{x_{3}} = 5.43 | s_{3}^{2} = 2.62

Were you able to help Tom with his new hypothesis?

In his first hypothesis test, since the sample variances were close to each other (58.28 and 54.89), we assumed that the population variances are equal and proceeded with a t-Test.

Under the proposition that the population variance of two random variables X_{1} and X_{2} are equal, i.e., \sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}, the test-statistic is t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the pooled variance. t_{0} follows a T-distribution with n_{1}+n_{2}-2 degrees of freedom.

But, can we make the same assumption in the new case? The Utica River sample has a sample variance s_{3}^{2} of 2.62. It will not be reasonable to assume that the population variances are equal.

How do we proceed when \sigma_{1}^{2} \neq \sigma_{2}^{2}?

Let’s go back a few steps and outline how we arrived at the test-statistic.

The hypothesis test is on the difference in means: \mu_{1} - \mu_{2}

A good estimator of the difference in population means is the difference in sample means: y = \bar{x_{1}} - \bar{x_{2}}

The expected value of y, E[y], is \mu_{1}-\mu_{2}, and the variance of y, V[y], is \frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}.

Since y \sim N(\mu_{1}-\mu_{2},\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}), its standardized version z = \frac{y-(\mu_{1}-\mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}} is the starting point to deduce the test-statistic.

This statistic reduces to z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}} under the null hypothesis that \mu_{1}-\mu_{2}=0.

Last week, we entertained the idea that we are comparing the difference in the means of two populations whose variance is equal and reasoned that the test-statistic follows a T-distribution with (n_{1}+n_{2}-2) degrees of freedom.

This is because the pooled population variance \sigma^{2} can be replaced by its unbiased estimator s^{2}, which, in turn is related to a Chi-squared distribution with (n_{1}+n_{2}-2) degrees of freedom.

When the population variance is not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, there is no pooled variance which can be related to the Chi-square distribution.

The best estimate of V[y] is obtained by replacing the individual population variances (\sigma_{1}^{2}, \sigma_{2}^{2}) by the sample variances (s_{1}^{2}, s_{2}^{2}).

Hence,

z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}

We should now identify the limiting distribution of V[y]

Bernard Lewis Welch, a British statistician, in his works in 1936, 1938, and 1947 explained that, with some adjustments, V[y] can be approximated to a Chi-square distribution, and hence the test-statistic \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}} can be appriximated to a T-distribution.

Let’s digest the salient points of his work

Assume \lambda_{1} = \frac{1}{n_{1}} and \lambda_{2} = \frac{1}{n_{2}}

Assume f_{1} = (n_{1}-1) and f_{2} = (n_{2}-1)

An estimate for the variance of y is \lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2}

Now, in Lesson 73, we learned that \frac{(n-1)s^{2}}{\sigma^{2}} is a Chi-square distribution. Based on this logic, we can write

s^{2} = \frac{1}{n-1}\sigma^{2}*\chi^{2}

So, \lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2} is of the form,

\lambda_{1}\frac{1}{n_{1}-1}\sigma_{1}^{2}\chi_{1}^{2}+\lambda_{2}\frac{1}{n_{2}-1}\sigma_{2}^{2}\chi_{2}^{2}

or,

\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2} = a\chi_{1}^{2}+b\chi_{2}^{2}

a = \frac{\lambda_{1}\sigma_{1}^{2}}{f_{1}} | b = \frac{\lambda_{2}\sigma_{2}^{2}}{f_{2}}

Welch showed that if z = a\chi_{1}^{2}+b\chi_{2}^{2}, then, the distribution of z can be approximated using a Chi-square distribution with a random variable \chi=\frac{z}{g} and f degrees of freedom.

He found the constants f and g by equating the moments of z with the moments of this Chi-square distribution, i.e., the Chi-square distribution where the random variable \chi is \frac{z}{g}.

This is how he finds f and g

The first and the second moments of a general Chi-square distribution are the degrees of freedom (f) and two times the degrees of freedom (2f).

The random variable we considered is \chi=\frac{z}{g}.

So,

E[\frac{z}{g}] = f | V[\frac{z}{g}] = 2f

Since g is a constant that needs to be estimated, we can reduce these equations as

\frac{1}{g}E[z] = f | \frac{1}{g^{2}}V[z] = 2f

Hence,

E[z] = gf | V[z] = 2g^{2}f

Now, let’s take the equation z = a\chi_{1}^{2}+b\chi_{2}^{2} and find the expected value and the variance of z.

E[z] = aE[\chi_{1}^{2}]+bE[\chi_{2}^{2}]

E[z] = af_{1}+bf_{2}

V[z] = a^{2}V[\chi_{1}^{2}]+b^{2}V[\chi_{2}^{2}]

V[z] = a^{2}2f_{1}+b^{2}2f_{2}=2a^{2}f_{1}+2b^{2}f_{2}

Now, he equates the moments derived using the equation for z to the moments derived from the Chi-square distribution of \frac{z}{g}

Equating the first moments: gf=af_{1}+bf_{2}

Equating the second moments: 2g^{2}f = 2a^{2}f_{1}+2b^{2}f_{2}

The above equation can be written as

g*(gf) = a^{2}f_{1}+b^{2}f_{2}

or,

g*(af_{1}+bf_{2}) = a^{2}f_{1}+b^{2}f_{2}

From here,

g = \frac{a^{2}f_{1}+b^{2}f_{2}}{af_{1}+bf_{2}}

Using this with gf=af_{1}+bf_{2}, we can obtain f.

f = \frac{(af_{1}+bf_{2})^{2}}{a^{2}f_{1}+b^{2}f_{2}}

Since we know the terms for a and b, we can say,

f = \frac{(\lambda_{1}\sigma_{1}^{2}+\lambda_{2}\sigma_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}\sigma_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}\sigma_{2}^{4}}{f_{2}}}

Since the variance of y follows an approximate Chi-square distribution with f degrees of freedom, we can assume that the test-statistic \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}} follows an approximate T-distribution with f degrees of freedom.

Welch also showed that an unbiased estimate for f is

f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}+2}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}+2}} - 2

Essentially, he replaces s_{1}^{2} for \sigma_{1}^{2}, and s_{2}^{2} for \sigma_{2}^{2}, and to correct for the bias, he adds 2 to the degrees of freedom in the denominator, and substracts an overall 2 from this fraction. He argues that this correction produces the best unbiased estimate for f.

Later authors like Franklin E. Satterthwaite showed that the bias correction might not be necessary, and it would suffice to use s_{1}^{2} for \sigma_{1}^{2}, and s_{2}^{2} for \sigma_{2}^{2} in the original equation, as in,

f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}}}

Since we know that \lambda_{1} = \frac{1}{n_{1}}, \lambda_{2} = \frac{1}{n_{2}}, f_{1} = (n_{1}-1), and f_{2} = (n_{2}-1), we can finally say

When the population variances are not equal, i.e., when \sigma_{1}^{2} \neq \sigma_{2}^{2}, the test-statistic is t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}, and it follows an approximate T-distribution with f degrees of freedom.

The degrees of freedom can be estimated as

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2

or,

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}

This is now popularly known as Welch's t-Test.

Let’s now go back to Tom and help him with his new theory.

We will compare the west branch Mohawk River with the Utica River.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
n_{1}=10 | \bar{x_{1}} = 12.5 | s_{1}^{2} = 58.28

Utica River: 4, 4, 6, 4, 5, 7, 8
n_{3}=7 | \bar{x_{3}} = 5.43 | s_{3}^{2} = 2.62

Since we cannot assume that the population variances are equal, we will use Welch’s t-Test.

We compute the test-statistic and check how likely it is to see such a value in a T-distribution (approximate null distribution) with so many degrees of freedom.

t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{3}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{3}^{2}}{n_{3}}}}

t_{0}^{*}=\frac{12.5-5.43}{\sqrt{\frac{58.28}{10}+\frac{2.62}{7}}}

t_{0}^{*}=2.84

Let’s compute the bias-corrected degrees of freedom suggested by Welch.

f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2

f = \frac{(\frac{58.28}{10}+\frac{2.62}{7})^{2}}{\frac{(58.28/10)^{2}}{(10 - 1) + 2}+\frac{(2.62/7)^{2}}{(7-1)+2}} - 2

f = 10.38

We can round it down to 10 degrees of freedom.

The test-statistic is 2.84. Since the alternate hypothesis is that the difference is greater than zero, Tom has to verify how likely it is to see a value greater than 2.84 in the approximate null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. 

Look at this visual.

The distribution is an approximate T-distribution with 10 degrees of freedom. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.37.

1.37 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with ten degrees of freedom.

If the test statistic (t_{0}^{*}) is greater than t_{critical}, which is 1.37, he will reject the null hypothesis. At that point (i.e., at values greater than 1.37), there would be sufficient confidence to say that the difference is significantly greater than zero.

It is equivalent to rejecting the null hypothesis if P(T > t_{0}^{*}) (the p-value) is less than \alpha

We can read t_{critical} off the standard T-table, or we can compute P(T > t_{0}^{*}) from the distribution.

At ten degrees of freedom (df=10), and \alpha=0.1, t_{critical}=1.372 and P(T>t_{0^{*}})=0.009.

Since the test-statistic t_{0}^{*} is in the rejection region, or since p-value < \alpha, Tom can reject the null hypothesis H_{0} that \mu_{1}-\mu_{2}=0.
He now has evidence beyond statistical doubt to claim that the factory is illegally dumping their untreated waste into the west branch Mohawk River.

Is it time for a lawsuit?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 94 – The Two-Sample Hypothesis Test – Part III

On the Difference in Means

using the t-Test

H_{0}: \mu_{1} - \mu_{2} = 0

H_{A}: \mu_{1} - \mu_{2} > 0

H_{A}: \mu_{1} - \mu_{2} < 0

H_{A}: \mu_{1} - \mu_{2} \neq 0

Tom grew up in the City of Mohawk, the land of natural springs, known for its pristine water. Tom’s house is alongside the west branch of Mohawk, one such pristine river. Every once in a while, Tom and his family go to the nature park on the banks of Mohawk. It is customary for Tom and his children to take a swim.

Lately, he has been reading in the local newspapers that the rivers’ arsenic levels have increased. Tom starts associating the cause of the alleged arsenic increases to this new factory in his neighborhood just upstream of MohawkThey could be illegally dumping their untreated waste into the west branch.

He decided to test the waters of the west branch and the east branch of the Mohawk River. 

His buddy Ron, an environmental engineer, would help him with the laboratory testing and the likes.

Over the next ten days, Tom and Ron collected water samples from the west and east branches, and Ron got them tested in his lab for arsenic concentration.

In parts per billion, the sample data looked like this.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5

If Tom’s theory is correct, they should find the average arsenic concentration in the west branch to be significantly greater than the average arsenic concentration in the east branch. 

How can Tom test his theory?

.

.

.

You are right!

He can use the hypothesis testing framework and verify if there is evidence beyond a statistical doubt.

Tom establishes the null and alternate hypotheses. He assumes that the factory does not illegally dump their untreated waste into the west branch, so the average arsenic concentration in the west branch should be equal to the average arsenic concentration in the east branch Mohawk River. Or, the difference in their means is zero.

H_{0}: \mu_{1} - \mu_{2} = 0

Against this null hypothesis, he pits his theory that they are indeed illegally dumping their untreated waste. So, the average arsenic concentration in the west branch should be greater than the average arsenic concentration in the east branch Mohawk River — the difference in their means is greater than zero.

H_{A}: \mu_{1} - \mu_{2} > 0

The alternate hypothesis is one-sided. A significant positive difference needs to be seen to reject the null hypothesis.

Tom is taking a 10% risk of rejecting the null hypothesis; \alpha = 0.1. His Type I error is 10%.

Suppose the factory does not affect the water quality, but, the ten samples he collected showed a difference in the sample means much greater than zero, he should reject the null hypothesis. So he is committing an error (Type I error) in his decision making.

You must be knowing that there is a certain level of subjectivity in the choice of \alpha.

Tom may want to prove that this factory is the leading cause for the increased arsenic levels in the west branch. So he could have chosen to commit a greater error of rejecting the null hypothesis, i.e., he must be inclined to selecting a larger value of \alpha.

Someone who represents the factory management would be inclined to selecting a smaller value for \alpha as that means less likely to reject the null hypothesis.

So, assuming that the null hypothesis is true, the decision to reject or not to reject is based on the value one chooses for \alpha.

Anyhow, now that the basic testing framework is set up, let’s look at what Tom needs.

He needs a test-statistic to represent the difference in the means of two samples. 
He needs the null distribution that this test-statistic converges to. In other words, he needs a probability distribution of the test-statistic to verify his null hypothesis -- how likely it is to see a value as large (or greater) as the test-statistic in the null distribution.

Let’s take Tom on a mathematical excursion

There are two samples represented by random variables X_{1} and X_{2}.

The mean and variance of X_{1} are \mu_{1} and \sigma_{1}^{2}. We have one sample of size n_{1} from this population. Suppose the sample mean and the sample variance are \bar{x_{1}} and s_{1}^{2}.

The mean and variance of X_{2} are \mu_{2} and \sigma_{2}^{2}. We have one sample of size n_{2} from this population. Suppose the sample mean and the sample variance are \bar{x_{2}} and s_{2}^{2}.

The hypothesis test is on the difference in means; \mu_{1} - \mu_{2}

Naturally, a good estimator of the difference in population means (\mu_{1}-\mu_{2}) is the difference in sample means (\bar{x_{1}}-\bar{x_{2}}).

y = \bar{x_{1}}-\bar{x_{2}}

If we know the probability distributions of \bar{x_{1}} and \bar{x_{2}}, we could perhaps infer the probability distribution of y.

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}] = \mu. We learned this in Lesson 67.

The variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n}. It indicates the spread around the center of the distribution. We learned this is Lesson 68.

Putting these two together, and with the central limit theorem, we can say \bar{x} \sim N(\mu,\frac{\sigma^{2}}{n})

So, for the two samples,

\bar{x_{1}} \sim N(\mu_{1},\frac{\sigma_{1}^{2}}{n_{1}}) | \bar{x_{2}} \sim N(\mu_{2},\frac{\sigma_{2}^{2}}{n_{2}})

If \bar{x_{1}} and \bar{x_{2}} are normal distributions, it is reasonable to assume that y=\bar{x_{1}}-\bar{x_{2}} will be a normal distribution.

y \sim N(E[y], V[y])

We should see what E[y] and V[y] are.

Expected Value of y

y = \bar{x_{1}} - \bar{x_{2}}

E[y] = E[\bar{x_{1}} - \bar{x_{2}}]

E[y] = E[\bar{x_{1}}] - E[\bar{x_{2}}]

Since the expected value of the sample mean is the true population mean,

E[y] = \mu_{1} - \mu_{2}

Variance of y

y = \bar{x_{1}} - \bar{x_{2}}

V[y] = V[\bar{x_{1}} - \bar{x_{2}}]

V[y] = V[\bar{x_{1}}] + V[\bar{x_{2}}] (can you tell why?)

Since, the variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n},

V[y] = \frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}

Using the expected value and the variance of y, we can now say that

y \sim N((\mu_{1} - \mu_{2}), (\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}))

Or, if we standardize it,

z = \frac{y-(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

At this point, it might be tempting to say that z is the test-statistic and the null distribution is the standard normal distribution.

Hold on!

We can certainly say that, under the null hypothesis, \mu_{1} = \mu_{2}. So, z further reduces to

z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

However, there are two unknowns here. The population variance \sigma_{1}^{2} and \sigma_{2}^{2}.

Think about making some assumptions about these unknowns

What could be a reasonable estimate for these unknowns?

.

.

.

You got it. The sample variance s_{1}^{2} and s_{2}^{2}.

Now, let’s take the samples that Tom and Ron collected and compute the sample mean and sample variance for each. The equations for these are anyhow at your fingertips!

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
\bar{x_{1}}=12.5 | s_{1}^{2}=58.28

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
\bar{x_{2}}=12 | s_{2}^{2}=54.89

Look at the value of the sample variance. 58.28 and 54.89. They seem close enough. Maybe, just maybe, could the variance of the samples be equal?

I am asking you to entertain the proposition that the population variance of the two random variables X_{1} and X_{2} are equal and that we are comparing the difference in the means of two populations whose variance is equal.

Say, \sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}

Our test-statistic will now reduce to

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma^{2}}{n_{1}} + \frac{\sigma^{2}}{n_{2}}}}

or

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

There is a common variance \sigma^{2}, and it should suffice to come up with a reasonable estimate for this combined variance.

Let’s say that s^{2} is a good estimate for \sigma^{2}.

We call this the estimate for the pooled variance

If we have a formula for s^{2} and if it is an unbiased estimate for \sigma^{2}, then, we can replace s^{2} for \sigma^{2}.

What is the right equation for s^{2}?

Since s^{2} is the estimate for the pooled variance, it will be reasonable to assume that it is some weighted average of the individual sample variances.

s^{2} = w_{1}s_{1}^{2} + w_{2}s_{2}^{2}

Let’s compute the expected value of s^{2}.

E[s^{2}] = E[w_{1}s_{1}^{2} + w_{2}s_{2}^{2}]

E[s^{2}] = w_{1}E[s_{1}^{2}] + w_{2}E[s_{2}^{2}]

In Lesson 70, we learned that sample variance s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} is an unbiased estimate for population variance \sigma^{2}. The factor (n-1) in the denominator is to correct for a bias of -\frac{\sigma^{2}}{n}.

An unbiased estimate means E[s^{2}]=\sigma^{2}. If we apply this understanding to the pooled variance equation, we can see that

E[s^{2}] = w_{1}\sigma_{1}^{2} + w_{2}\sigma_{2}^{2}

Since \sigma_{1}^{2}=\sigma_{2}^{2} = \sigma^{2}, we get

E[s^{2}] = w_{1}\sigma^{2} + w_{2}\sigma^{2}

E[s^{2}] = (w_{1} + w_{2})\sigma^{2}

To get an unbiased estimate for \sigma^{2}, we need the weights to add up to 1. What could those weights be? Could it relate to the sample sizes?

With that idea in mind, let’s take a little detour.

Let me ask you a question.

What is the probability distribution that relates to sample variance?

.

.

.

You might have to go down memory lane to Lesson 73.

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom. We learned that the term \frac{(n-1)s^{2}}{\sigma^{2}} is a sum of (n-1) squared standard normal distributions.

So,

\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} \sim \chi^{2}_{n_{1}-1} | \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n_{2}-1}

Since the two samples are independent, the sum of these two terms, \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} will follow a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Add them and see. \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} is a sum of (n_{1}-1) squared standard normal distributions, and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} is a sum of (n_{2}-1) squared standard normal distributions. Together, they are a sum of (n_{1}+n_{2}-2) squared standard normal distributions.

Since \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} + \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

we can say,

\frac{(n_{1}+n_{2}-2)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

So we can think of developing weights in terms of the degrees of freedom of the Chi-square distribution. The first sample contributes n_{1}-1 degrees of freedom, and the second sample contributes n_{2}-1 degrees of freedom towards a total of n_{1}+n_{2}-2.

So the weight of the first sample can be w_{1}=\frac{n_{1}-1}{n_{1}+n_{2}-2} and the weight of the second sample can be w_{2}=\frac{n_{2}-1}{n_{1}+n_{2}-2}, and they add up to 1.

This then means that the equation for the estimate of the pooled variance is

s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

Since it is an unbiased estimate of \sigma^{2}, we can use s^{2} in place of \sigma^{2} in our test-statistic, which then looks like this.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where, s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the estimate for the pooled variance.

Did you notice that I use t_{0} to represent the test-statistic?

Yes, I am getting to T with it.

It is a logical extension of the idea that, for one sample, t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom.

There, the idea was derived from the fact that when you replace the population variance with sample variance, and the sample variance is related to a Chi-square distribution with (n-1) degrees of freedom, the test-statistic t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom. Check out Lesson 73 (Learning from “Student”) to refresh your memory.

Here, in the case of the difference in means between two samples (\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}), the pooled population variance \sigma^{2} is replaced by its unbiased estimator (\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}, which, in turn, is related to a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Hence, under the proposition that the population variance of the two random variables X_{1} and X_{2} are equal, the test-statistic is t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, and it follows a T-distribution with (n_{1}+n_{2}-2) degrees of freedom.


Now let’s evaluate the hypothesis that Tom set up — Finally!!

We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

In Tom’s case, both n_{1}=n_{2}=10. So the weights will be equal to 0.5.

s^{2}=0.5s_{1}^{2}+0.5s_{2}^{2}

s^{2}=0.5*58.28+0.5*54.89

s^{2}=56.58

t_{0} = \frac{12.5-12}{\sqrt{56.58(\frac{1}{10}+\frac{1}{10})}}

t_{0} = \frac{0.5}{\sqrt{11.316}}

t_{0} = 0.1486

The test-statistic is 0.1486. Since the alternate hypothesis is that the difference is greater than zeros (H_{A}:\mu_{1}-\mu_{2}>0), Tom has to verify how likely it is to see a value greater than 0.1486 in the null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. A smaller than \alpha probability indicates that the difference is sufficiently large enough that, in a T-distribution with so many degrees of freedom, the likelihood of seeing a value greater than the test-statistic is small. In other words, the difference in the mean is already sufficiently greater than zero and in the region of rejection.

Look at this visual.

The distribution is a T-distribution with 18 degrees of freedom (10 + 10 – 2). Tom had collected ten samples each for this test. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.33.

1.33 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with eighteen degrees of freedom.

If the test statistic (t_{o}) is greater than t_{critical}, which is 1.33, he will reject the null hypothesis. At that point (i.e., at values greater than 1.33), there would be sufficient confidence to say that the difference is significantly greater than zero.

This decision is equivalent to rejecting the null hypothesis if P(T>t_{0}) (the p-value) is less than \alpha.

We can read t_{critical} off the standard T-table, or P(T>t_{0}) can be computed from the distribution.

At df = 18, and \alpha=0.1, t_{critical}=1.33 and P(T>t_{0})=0.44.

Since the test-statistic t_{0} is not in the rejection region, or since p-value > \alpha, Tom cannot reject the null hypothesis H_{0} that \mu_{1} - \mu_{2}=0
He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

Tom is not convinced. The endless newspaper stories on arsenic levels are bothering him. He begins to think if the factory is illegally dumping their untreated waste into both the west and east branches of the Mohawk River. That is one reason why he might have seen no significant difference in the concentrations.

Over the next week, he takes Ron with him to Utica River, a tributary of Mohawk that branches off right before Mohawk meets the factory. If his new theory is correct, he should find that the mean arsenic concentration in either the west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

Ron again helped him with the laboratory testing. In parts per billion, the third sample data looks like this.

Utica River: 4, 4, 6, 4, 5, 7, 8

There are 7 data points. n_{3}=7

The sample mean \bar{x_{3}}=5.43

The sample variance s_{3}^{2} = 2.62

Can you help Tom with his new hypothesis tests? Does the proposition still hold?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 93 – The Two-Sample Hypothesis Test – Part II

On the Difference in Proportions

H_{0}: p_{1}-p_{2} = 0

H_{A}: p_{1}-p_{2} > 0

H_{A}: p_{1}-p_{2} < 0

H_{A}: p_{1}-p_{2} \neq 0

Joe and Mumble are interested in getting people’s opinion on the preference for a higher than 55 mph speed limit for New York State.
Joe spoke to ten of his rural friends, of which seven supported the idea of increasing the speed limit to 65 mph. Mumble spoke to eighteen of his urban friends, of which five favored a speed limit of 65 mph over the current limit of 55 mph.

Can we say that the sentiment for increasing the speed limit is stronger among rural than among urban residents?

We can use a hypothesis testing framework to address this question.

Last week, we learned how Fisher’s Exact test could be used to verify the difference in proportions. The test-statistic for the two-sample hypothesis test follows a hypergeometric distribution when H_{0} is true.

We also learned that, in more generalized cases where the number of successes is not known apriori, we could assume that the number of successes is fixed at t=x_{1}+x_{2}, and, for a fixed value of t, we reject H_{0}:p_{1}=p_{2} for the alternate hypothesis H_{A}:p_{1}>p_{2} if there are more successes in random variable X_{1} compared to X_{2}.

In short, the p-value can be derived under the assumption that the number of successes X=k in the first sample X_{1} has a hypergeometric distribution when H_{0} is true and conditional on a total number of t successes that can come from any of the two random variables X_{1} and X_{2}.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}


Let’s apply this principle to the two samples that Joe and Mumble collected.

Let X_{1} be the random variable that denotes Joe’s rural sample. He surveyed a total of n_{1}=10 people and x_{1}=7 favored an increase in the speed limit. So the proportion p_{1} based on the number of successes is 0.7.

Let X_{2} be the random variable that denotes Mumble’s urban sample. He surveyed a total of n_{2}=18 people. x_{2}=5 out of the 18 favored an increase in the speed limit. So the proportion p_{2} based on the number of successes is 0.2778.

Let the total number of successes in both the samples be t=x_{1}+x_{2}=7+5=12.

Let’s also establish the null and alternate hypotheses.

H_{0}: p_{1}-p_{2}=0

H_{A}: p_{1}-p_{2}>0

The alternate hypothesis says that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Larger values of x_{1} and smaller values of x_{2} support the alternate hypothesis H_{A} that p_{1}>p_{2} when t is fixed.

For a fixed value of t, we reject H_{0}, if there are more number of successes in X_{1} compared to X_{2}.

Conditional on a total number of t successes from any of the two random variables, the number of successes X=k in the first sample has a hypergeometric distribution when H_{0} is true.

In the rural sample that Joe surveyed, seven favored an increase in the speed limit. So we can compute the p-value as the probability of obtaining more than seven successes in a rural sample of 10 when the total successes t from either urban or rural samples are twelve.

p-value=P(X \ge k) = P(X \ge 7)

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{10+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338

A total of 12 successes exist, out of which the number of ways of choosing 7 is \binom{12}{7}.

A total of 28 – 12 = 16 non-successes exist, out of which the number of ways of choosing 10 – 7 = 3 non-successes is \binom{16}{3}.

A total sample of 10 + 18 = 28 exists, out of which the number of ways of choosing ten samples is \binom{28}{10}.

When we put them together, we can derive the probability P(X=7) for the hypergeometric distribution when H_{0} is true.

P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{18+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338

Applying the same logic for k = 8, 9, and 10, we can derive their respective probabilities.

P(X=8) = \frac{\binom{12}{8}\binom{10+18-12}{10-8}}{\binom{10+18}{10}} =\frac{\binom{12}{8}\binom{16}{2}}{\binom{28}{10}} = 0.0045

P(X=9) = \frac{\binom{12}{9}\binom{10+18-12}{10-9}}{\binom{10+18}{10}} =\frac{\binom{12}{9}\binom{16}{1}}{\binom{28}{10}} = 0.0003

P(X=10) = \frac{\binom{12}{10}\binom{10+18-12}{10-10}}{\binom{10+18}{10}} =\frac{\binom{12}{10}\binom{16}{0}}{\binom{28}{10}} = 5.029296*10^{-6}

The p-value can be computed as the sum of these probabilities.

p-value=P(X \ge k) = P(X = 7)+P(X = 8)+P(X = 9)+P(X = 10)=0.0386

Visually, the null distribution will look like this.

The x-axis shows the number of possible successes in X_{1}. They range from k = 0 to k = 10. The vertical bars are showing P(X=k) as derived from the hypergeometric distribution. The area highlighted in red is the p-value, the probability of finding \ge seven successes in a rural sample of 10 people.

The p-value is the probability of obtaining the computed test statistic under the null hypothesis. 

The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

Suppose we select a rate of error \alpha of 5%.

Since the p-value (0.0386) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Let me remind you that this decision is based on the assumption that the null hypothesis is correct. Under this assumption, since we selected \alpha = 5\%, we will reject the true null hypothesis 5% of the time. At the same time, we will fail to reject the null hypothesis 95% of the time. In other words, 95% of the time, our decision to not reject the null hypothesis will be correct.

What if Joe and Mumble surveyed many more people?

You must be wondering that Joe and Mumble surveyed just a few people, which is not enough to derive any decent conclusion for a question like this. Perhaps they just called up their friends!

Let’s do a thought experiment. How would the null distribution look like if Joe and Mumble had double the sample size and the successes also increase in the same proportion? Would the p-value change?

Say Joe had surveyed 20 people, and 14 had favored an increase in the speed limit. n_{1} = 20; x_{1} = 14; p_{1} = 0.7.

Say Mumble had surveyed 36 people, and 10 had favored an increase in the speed limit. n_{2} = 36; x_{2} = 10; p_{2} = 0.2778.

p-value will then be P(X \ge 14) when there are 24 total successes.

The null distribution will look like this.

Notice that the null distribution is much more symmetric and looks like a bell curve (normal distribution) with an increase in the sample size. The p-value is 0.0026. More substantial evidence for rejecting the null hypothesis.

Is there a limiting distribution for the difference in proportion? If there is one, can we use it as the null distribution for the hypothesis test on the difference in proportion when the sample sizes are large.

While we embark on this derivation, let’s ask Joe and Mumble to survey many more people. When they are back, we will use new data to test the hypothesis.

But first, what is the limiting distribution for the difference in proportion?

We have two samples X_{1} and X_{2} of sizes n_{1} and n_{2}.

We might observe x_{1} and x_{2} successes in each of these samples. Hence, the proportions p_{1}, p_{2} can be estimated using \hat{p_{1}} = \frac{x_{1}}{n_{1}} and \hat{p_{2}} = \frac{x_{2}}{n_{2}}.

See, we are using \hat{p_{1}}, \hat{p_{2}} as the estimates of the true proportions p_{1}, p_{2}.

Take X_{1}. If the probability of success (proportion) is p_{1}, in a sample of n_{1}, we could observe x_{1}=0, 1, 2, 3, \cdots, n_{1} successes with a probabilty P(X=x_{1}) that is governed by a binomial distribution. In other words,

x_{1} \sim Bin(n_{1},p_{1})

Same logic applies to X_{2}.

x_{2} \sim Bin(n_{2},p_{2})

A binomial distribution tends to a normal distribution for large sample sizes; it can be estimated very accurately using the normal density function. We learned this in Lesson 48.

If you are curious as to how a binomial distribution function f(x)=\frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x} can approximated to a normal density function f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}, look at this link.

But what is the limiting distribution for \hat{p_{1}} and \hat{p_{2}}?

x_{1} is the sum of n_{1} independent Bernoulli random variables (yes or no responses from the people). For a large enough sample size n_{1}, the distribution function of x_{1}, which is a binomial distribution, can be well-approximated by the normal distribution. Since \hat{p_{1}} is a linear function of x_{1}, the random variable \hat{p_{1}} can also be assumed to be normally distributed.

When both \hat{p_{1}} and \hat{p_{2}} are normally distributed, and when they are independent of each other, their sum or difference will also be normally distributed. We can derive it using the convolution of \hat{p_{1}} and \hat{p_{2}}.

Let Y = \hat{p_{1}}-\hat{p_{2}}

Y \sim N(E[Y], V[Y]) since both \hat{p_{1}}, \hat{p_{2}} \sim N()

If Y \sim N(E[Y], V[Y]), we can standardize it to a standard normal variable as

Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)

We should now derive the expected value E[Y] and the variance V[Y] of Y.

Y = \hat{p_{1}}-\hat{p_{2}}

E[Y] = E[\hat{p_{1}}-\hat{p_{2}}] = E[\hat{p_{1}}] - E[\hat{p_{2}}]

V[Y] = V[\hat{p_{1}}-\hat{p_{2}}] = V[\hat{p_{1}}] + V[\hat{p_{2}}]

Since they are independent, the co-variability term which carries the negative sign is zero.

We know that E[\hat{p_{1}}] = p_{1} and V[\hat{p_{1}}]=\frac{p_{1}(1-p_{1})}{n_{1}}. Recall Lesson 76.

When we put them together,

E[Y] = p_{1} - p_{2}

V[Y] = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}

and finally since Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1),

Z = \frac{\hat{p_{1}} - \hat{p_{2}} - (p_{1} - p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}} \sim N(0, 1)

A few more steps and we are done. Joe and Mumble must be waiting for us.

The null hypothesis is H_{0}: p_{1}-p_{2}=0. Or, p_{1}=p_{2}.

We need the distribution under the null hypothesis — the null distribution.

Under the null hypothesis, let’s assume that p_{1}=p_{2} is p, a common value for the two population proportions.

Then, the expected value of Y, E[Y]=p_{1}-p_{2}=p-p = 0 and the variance V[Y] = \frac{p(1-p)}{n_{1}} + \frac{p(1-p)}{n_{2}}}

V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})

This shared value p for the two population proportions can be estimated by pooling the samples together into one sample of size n_{1}+n_{2} where there are x_{1} and x_{2} total successes.

p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}

Look at this estimate carefully. Can you see that the pooled estimate p is a weighted average of the two proportions (p_{1} and p_{2})?

.
.
.
Okay, tell me what x_{1} and x_{2} are? Aren’t they n_{1}\hat{p_{1}} and n_{2}\hat{p_{2}} for the given two samples?

So p = \frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}=\frac{n_{1}}{n_{1}+n_{2}}\hat{p_{1}}+ \frac{n_{2}}{n_{1}+n_{2}}\hat{p_{2}}

or, p = w_{1}\hat{p_{1}}+ w_{2}\hat{p_{2}}

At any rate,

E[Y]= 0

V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})

p=\frac{x_{1}+x_{2}}{n_{1}+n_{2}}

To summarize, when the null hypothesis is

H_{0}:p_{1}-p_{2}=0

for large sample sizes, the test-statistic z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)

If the alternate hypothesis H_{A} is p_{1}-p_{2}>0, we reject the null hypothesis when the p-value P(Z \ge z) is less than the rate of rejection \alpha. We can also say that when z > z_{\alpha}, we reject the null hypothesis.

If the alternate hypothesis H_{A} is p_{1}-p_{2}<0, we reject the null hypothesis when the p-value P(Z \le z) is less than the rate of rejection \alpha. Or when z < -z_{\alpha}, we reject the null hypothesis.

If the alternate hypothesis H_{A} is p_{1}-p_{2} \neq 0, we reject the null hypothesis when the p-value P(Z \le z) or P(Z \ge z) is less than the rate of rejection \frac{\alpha}{2}. Or when z < -z_{\frac{\alpha}{2}} or z > z_{\frac{\alpha}{2}}, we reject the null hypothesis.

Okay, we are done. Let’s see what Joe and Mumble have.

The rural sample X_{1} has n_{1}=190 and x_{1}=70.

The urban sample X_{2} has n_{2}=310 and x_{2}=65.

Let’s first compute the estimates for the respective proportions — p_{1} and p_{2}.

\hat{p_{1}}=\frac{x_{1}}{n_{1}}=\frac{70}{190} = 0.3684

\hat{p_{2}}=\frac{x_{2}}{n_{2}}=\frac{65}{310} = 0.2097

Then, let’s compute the pooled estimate p for the population proportions.

p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}=\frac{70+65}{190+310}=\frac{135}{500}=0.27

Next, let’s compute the test-statistics under the large-sample assumption. 190 and 310 are pretty large samples.

z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

z = \frac{0.3684-0.2097}{\sqrt{0.27(0.73)*(\frac{1}{190}+\frac{1}{310})}}=3.8798

Since our alternate hypothesis H_{A} is p_{1}-p_{2}>0, we compute the p-value as,
p-value=P(Z \ge 3.8798) = 5.227119*10^{-5} \approx 0

Since the p-value (~0) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural (p_{1}) than among urban residents (p_{2}).

Remember that the test-statistic is computed for the null hypothesis that p_{1}-p_{2}=0. What if the null hypothesis is not that the difference in proportions is zero but is equal to some value? p_{1}-p_{2}=0.25

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 92 – The Two-Sample Hypothesis Test – Part I

Fisher’s Exact Test

You may remember this from Lesson 38, where we derived the hypergeometric distribution from first principles.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

P(X=k) = \frac{\binom{R}{k}\binom{N-R}{R-k}}{\binom{N}{R}}

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution. The control parameters of the hypergeometric distribution are N and R.

For example, if there are five cans in total, out of which three are Pepsi cans, picking exactly two Pepsi cans can be done in \binom{3}{2}*\binom{2}{1} ways. Two Pepsi cans selected from three in \binom{3}{2} ways; one Coke can be selected from two Coke cans in \binom{2}{1}.

The overall possibilities of selecting three cans from a total of five cans are \binom{5}{3}.

Hence, P(X=2)=\frac{\binom{3}{2}*\binom{2}{1}}{\binom{5}{3}}=\frac{6}{10}


Now, suppose there are eight cans out of which four are Pepsi, and four are Coke, and we are testing John’s ability to identify Pepsi.

Since John has a better taste for Pepsi, he claims that he has a greater propensity to identify Pepsi from the hidden cans.

Of course, we don’t believe it, and we think his ability to identify Pepsi is no different than his ability to identify Coke.

Suppose his ability (probability) to identify Pepsi is p_{1} and his ability to identify Coke is p_{2}. We think p_{1}=p_{2} and John thinks p_{1} > p_{2}.

The null hypothesis that we establish is
H_{0}: p_{1} = p_{2}

John has an alternate hypothesis
H_{A}: p_{1} > p_{2}

Pepsi and Coke cans can be considered as two samples of four each.

Since there are two samples (Pepsi and Coke) and two outcomes (identifying or not identifying Pepsi), we can create a 2×2 contingency table like this.

John now identifies four cans as Pepsi out of the eight cans whose identity is hidden as in the fun experiment.

It turns out that the result of the experiment is as follows.

John correctly identified three Pepsi cans out of the four.

The probability that he will identify three correctly while sampling from a total of eight cans is

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}=\frac{\frac{4!}{1!3!}\frac{4!}{3!1!}}{\frac{8!}{4!4!}}=\frac{16}{70}=0.2286

If you recall from the prior hypothesis test lessons, you will ask for the null distribution. The null distribution is the probability distribution of observing any number of Pepsi cans while selecting from a total of eight cans (out of which four are known to be Pepsi). This will be the distribution that shows P(X=0), P(X=1), P(X=2), P(X=3), and P(X=4). Let’s compute these and present them visually.

P(X=0)=\frac{\binom{4}{0}*\binom{4}{4}}{\binom{8}{4}}==\frac{1}{70}=0.0143

P(X=1)=\frac{\binom{4}{1}*\binom{4}{3}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=2)=\frac{\binom{4}{2}*\binom{4}{2}}{\binom{8}{4}}==\frac{36}{70}=0.5143

P(X=3)=\frac{\binom{4}{3}*\binom{4}{1}}{\binom{8}{4}}==\frac{16}{70}=0.2286

P(X=4)=\frac{\binom{4}{4}*\binom{4}{0}}{\binom{8}{4}}==\frac{1}{70}=0.0143

In a hypergeometric null distribution with N = 8 and R = 4, what is the probability of getting a larger value than 3? If this has a sufficiently low probability, we cannot say that it may occur by chance.

This probability is the p-value. It is the probability of obtaining the computed test statistic under the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

P(X \ge 3)=P(X=3) + P(X=4) = 0.2286+0.0143=0.2429

Let us select a rate of error \alpha of 10%.

Since the p-value (0.2429) is greater than our selected rate of error (0.1), we cannot reject the null hypothesis that the probability of choosing Pepsi and the probability of choosing Coke are the same.

John should have selected all four Pepsi cans for us to be able to reject the null hypothesis (H_{0}: p_{1} = p_{2}) in favor of the alternative hypothesis (H_{A}: p_{1} > p_{2}) conclusively.


The Famous Fisher Test

We just saw a variant of the famous test conducted by Ronald Fisher in 1919 when he devised an offhand test of a lady’s ability to differentiate between tea prepared in two different ways.

One afternoon, at tea-time in Rothamsted Field Station in England, a lady proclaimed that she preferred her tea with the milk poured into the cup after the tea, rather than poured into the cup before the tea. Fisher challenged the lady and presented her with eight cups of tea; four made the way she preferred, and four made the other way. She was told that there were four of each kind and asked to determine which four were prepared properly. Fisher subsequently used this experiment to illustrate the basic issues in experimentation.

sourced from Chapter 5 of “Teaching Statistics, a bag of tricks” by Andrew Gelman and Deborah Nolan

This test, now popular as Fisher’s Exact Test, is the basis for the two-sample hypothesis test to verify the difference in proportions. Just like how the proportion (p) for the one-sample test followed a binomial null distribution, the test-statistic for the two-sample test follows a hypergeometric distribution when H_{0} is true.

Here, where we know the exact number of correct Pepsi cans, the true distribution of the test-statistic (number of correct Pepsi cans) is hypergeometric. In more generalized cases where the number of successes is not known apriori, we need to make some assumptions.


Say there are two samples represented by random variables X_{1} and X_{2} with sample sizes n_{1} and n_{2}. The proportion p_{1} is based on the number of successes (x_{1}) in X_{1}, and the proportion p_{2} is based on the number of successes (x_{2}) in X_{2}. Let the total number of successes in both the samples be t=x_{1}+x_{2}.

If the null hypothesis is H_{0}: p_{1} = p_{2}, then, large values of x_{1} and small values of x_{2} support the alternate hypothesis that H_{A}: p_{1} > p_{2} when t is fixed.

In other words, for a fixed value of t=x_{1}+x_{2}, we reject H_{0}: p_{1} = p_{2}, if there are more successes in X_{1} compared to X_{2}.

So the question is: what is the probability distribution of x_{1} when the total successes are fixed at t, and we have a total of n_{1}+n_{2} samples.

When the number of successes is t, and when H_{0}: p_{1} = p_{2} is true, these successes can come from any of the two random variables with equal likelihood.

A total sample of n_{1}+n_{2} exists out of which the number of ways of choosing n_{1} samples is \binom{n_{1}+n_{2}}{n_{1}}.

A total of t successes exist, out of which the number of ways of choosing k is \binom{t}{k}.

A total of n_{1}+n_{2}-t non-successes exist, out of which the number of ways of choosing n_{1}-k is \binom{n_{1}+n_{2}-t}{n_{1}-k}.

When we put them together, we can derive the probability P(X=k) for the hypergeometric distribution when H_{0} is true.

P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}

Conditional on a total number of t successes that can come from any of the two random variables, the number of successes X=k in the first sample has a hypergeometric distribution when H_{0} is true.
The p-value can thus be derived.

We begin diving into the two-sample tests. Fisher’s Exact Test and its generalization (with assumptions) for the two-sample hypothesis test on the proportion is the starting point. It is a direct extension of the one-sample hypothesis test on proportion — albeit with some assumptions. Assumptions are crucial for the two-sample hypothesis tests. As we study the difference in the parameters of the two populations, we will delve into them more closely.
STAY TUNED!

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)