Lesson 94 – The Two-Sample Hypothesis Test – Part III

On the Difference in Means

using the t-Test

H_{0}: \mu_{1} - \mu_{2} = 0

H_{A}: \mu_{1} - \mu_{2} > 0

H_{A}: \mu_{1} - \mu_{2} < 0

H_{A}: \mu_{1} - \mu_{2} \neq 0

Tom grew up in the City of Mohawk, the land of natural springs, known for its pristine water. Tom’s house is alongside the west branch of Mohawk, one such pristine river. Every once in a while, Tom and his family go to the nature park on the banks of Mohawk. It is customary for Tom and his children to take a swim.

Lately, he has been reading in the local newspapers that the rivers’ arsenic levels have increased. Tom starts associating the cause of the alleged arsenic increases to this new factory in his neighborhood just upstream of MohawkThey could be illegally dumping their untreated waste into the west branch.

He decided to test the waters of the west branch and the east branch of the Mohawk River. 

His buddy Ron, an environmental engineer, would help him with the laboratory testing and the likes.

Over the next ten days, Tom and Ron collected water samples from the west and east branches, and Ron got them tested in his lab for arsenic concentration.

In parts per billion, the sample data looked like this.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5

If Tom’s theory is correct, they should find the average arsenic concentration in the west branch to be significantly greater than the average arsenic concentration in the east branch. 

How can Tom test his theory?

.

.

.

You are right!

He can use the hypothesis testing framework and verify if there is evidence beyond a statistical doubt.

Tom establishes the null and alternate hypotheses. He assumes that the factory does not illegally dump their untreated waste into the west branch, so the average arsenic concentration in the west branch should be equal to the average arsenic concentration in the east branch Mohawk River. Or, the difference in their means is zero.

H_{0}: \mu_{1} - \mu_{2} = 0

Against this null hypothesis, he pits his theory that they are indeed illegally dumping their untreated waste. So, the average arsenic concentration in the west branch should be greater than the average arsenic concentration in the east branch Mohawk River — the difference in their means is greater than zero.

H_{A}: \mu_{1} - \mu_{2} > 0

The alternate hypothesis is one-sided. A significant positive difference needs to be seen to reject the null hypothesis.

Tom is taking a 10% risk of rejecting the null hypothesis; \alpha = 0.1. His Type I error is 10%.

Suppose the factory does not affect the water quality, but, the ten samples he collected showed a difference in the sample means much greater than zero, he should reject the null hypothesis. So he is committing an error (Type I error) in his decision making.

You must be knowing that there is a certain level of subjectivity in the choice of \alpha.

Tom may want to prove that this factory is the leading cause for the increased arsenic levels in the west branch. So he could have chosen to commit a greater error of rejecting the null hypothesis, i.e., he must be inclined to selecting a larger value of \alpha.

Someone who represents the factory management would be inclined to selecting a smaller value for \alpha as that means less likely to reject the null hypothesis.

So, assuming that the null hypothesis is true, the decision to reject or not to reject is based on the value one chooses for \alpha.

Anyhow, now that the basic testing framework is set up, let’s look at what Tom needs.

He needs a test-statistic to represent the difference in the means of two samples. 
He needs the null distribution that this test-statistic converges to. In other words, he needs a probability distribution of the test-statistic to verify his null hypothesis -- how likely it is to see a value as large (or greater) as the test-statistic in the null distribution.

Let’s take Tom on a mathematical excursion

There are two samples represented by random variables X_{1} and X_{2}.

The mean and variance of X_{1} are \mu_{1} and \sigma_{1}^{2}. We have one sample of size n_{1} from this population. Suppose the sample mean and the sample variance are \bar{x_{1}} and s_{1}^{2}.

The mean and variance of X_{2} are \mu_{2} and \sigma_{2}^{2}. We have one sample of size n_{2} from this population. Suppose the sample mean and the sample variance are \bar{x_{2}} and s_{2}^{2}.

The hypothesis test is on the difference in means; \mu_{1} - \mu_{2}

Naturally, a good estimator of the difference in population means (\mu_{1}-\mu_{2}) is the difference in sample means (\bar{x_{1}}-\bar{x_{2}}).

y = \bar{x_{1}}-\bar{x_{2}}

If we know the probability distributions of \bar{x_{1}} and \bar{x_{2}}, we could perhaps infer the probability distribution of y.

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}] = \mu. We learned this in Lesson 67.

The variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n}. It indicates the spread around the center of the distribution. We learned this is Lesson 68.

Putting these two together, and with the central limit theorem, we can say \bar{x} \sim N(\mu,\frac{\sigma^{2}}{n})

So, for the two samples,

\bar{x_{1}} \sim N(\mu_{1},\frac{\sigma_{1}^{2}}{n_{1}}) | \bar{x_{2}} \sim N(\mu_{2},\frac{\sigma_{2}^{2}}{n_{2}})

If \bar{x_{1}} and \bar{x_{2}} are normal distributions, it is reasonable to assume that y=\bar{x_{1}}-\bar{x_{2}} will be a normal distribution.

y \sim N(E[y], V[y])

We should see what E[y] and V[y] are.

Expected Value of y

y = \bar{x_{1}} - \bar{x_{2}}

E[y] = E[\bar{x_{1}} - \bar{x_{2}}]

E[y] = E[\bar{x_{1}}] - E[\bar{x_{2}}]

Since the expected value of the sample mean is the true population mean,

E[y] = \mu_{1} - \mu_{2}

Variance of y

y = \bar{x_{1}} - \bar{x_{2}}

V[y] = V[\bar{x_{1}} - \bar{x_{2}}]

V[y] = V[\bar{x_{1}}] + V[\bar{x_{2}}] (can you tell why?)

Since, the variance of the sample mean (V[\bar{x}]) is \frac{\sigma^{2}}{n},

V[y] = \frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}

Using the expected value and the variance of y, we can now say that

y \sim N((\mu_{1} - \mu_{2}), (\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}))

Or, if we standardize it,

z = \frac{y-(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

At this point, it might be tempting to say that z is the test-statistic and the null distribution is the standard normal distribution.

Hold on!

We can certainly say that, under the null hypothesis, \mu_{1} = \mu_{2}. So, z further reduces to

z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)

However, there are two unknowns here. The population variance \sigma_{1}^{2} and \sigma_{2}^{2}.

Think about making some assumptions about these unknowns

What could be a reasonable estimate for these unknowns?

.

.

.

You got it. The sample variance s_{1}^{2} and s_{2}^{2}.

Now, let’s take the samples that Tom and Ron collected and compute the sample mean and sample variance for each. The equations for these are anyhow at your fingertips!

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
\bar{x_{1}}=12.5 | s_{1}^{2}=58.28

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
\bar{x_{2}}=12 | s_{2}^{2}=54.89

Look at the value of the sample variance. 58.28 and 54.89. They seem close enough. Maybe, just maybe, could the variance of the samples be equal?

I am asking you to entertain the proposition that the population variance of the two random variables X_{1} and X_{2} are equal and that we are comparing the difference in the means of two populations whose variance is equal.

Say, \sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}

Our test-statistic will now reduce to

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma^{2}}{n_{1}} + \frac{\sigma^{2}}{n_{2}}}}

or

\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

There is a common variance \sigma^{2}, and it should suffice to come up with a reasonable estimate for this combined variance.

Let’s say that s^{2} is a good estimate for \sigma^{2}.

We call this the estimate for the pooled variance

If we have a formula for s^{2} and if it is an unbiased estimate for \sigma^{2}, then, we can replace s^{2} for \sigma^{2}.

What is the right equation for s^{2}?

Since s^{2} is the estimate for the pooled variance, it will be reasonable to assume that it is some weighted average of the individual sample variances.

s^{2} = w_{1}s_{1}^{2} + w_{2}s_{2}^{2}

Let’s compute the expected value of s^{2}.

E[s^{2}] = E[w_{1}s_{1}^{2} + w_{2}s_{2}^{2}]

E[s^{2}] = w_{1}E[s_{1}^{2}] + w_{2}E[s_{2}^{2}]

In Lesson 70, we learned that sample variance s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} is an unbiased estimate for population variance \sigma^{2}. The factor (n-1) in the denominator is to correct for a bias of -\frac{\sigma^{2}}{n}.

An unbiased estimate means E[s^{2}]=\sigma^{2}. If we apply this understanding to the pooled variance equation, we can see that

E[s^{2}] = w_{1}\sigma_{1}^{2} + w_{2}\sigma_{2}^{2}

Since \sigma_{1}^{2}=\sigma_{2}^{2} = \sigma^{2}, we get

E[s^{2}] = w_{1}\sigma^{2} + w_{2}\sigma^{2}

E[s^{2}] = (w_{1} + w_{2})\sigma^{2}

To get an unbiased estimate for \sigma^{2}, we need the weights to add up to 1. What could those weights be? Could it relate to the sample sizes?

With that idea in mind, let’s take a little detour.

Let me ask you a question.

What is the probability distribution that relates to sample variance?

.

.

.

You might have to go down memory lane to Lesson 73.

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom. We learned that the term \frac{(n-1)s^{2}}{\sigma^{2}} is a sum of (n-1) squared standard normal distributions.

So,

\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} \sim \chi^{2}_{n_{1}-1} | \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n_{2}-1}

Since the two samples are independent, the sum of these two terms, \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} will follow a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Add them and see. \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} is a sum of (n_{1}-1) squared standard normal distributions, and \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} is a sum of (n_{2}-1) squared standard normal distributions. Together, they are a sum of (n_{1}+n_{2}-2) squared standard normal distributions.

Since \frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} + \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

we can say,

\frac{(n_{1}+n_{2}-2)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}

So we can think of developing weights in terms of the degrees of freedom of the Chi-square distribution. The first sample contributes n_{1}-1 degrees of freedom, and the second sample contributes n_{2}-1 degrees of freedom towards a total of n_{1}+n_{2}-2.

So the weight of the first sample can be w_{1}=\frac{n_{1}-1}{n_{1}+n_{2}-2} and the weight of the second sample can be w_{2}=\frac{n_{2}-1}{n_{1}+n_{2}-2}, and they add up to 1.

This then means that the equation for the estimate of the pooled variance is

s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

Since it is an unbiased estimate of \sigma^{2}, we can use s^{2} in place of \sigma^{2} in our test-statistic, which then looks like this.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where, s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2} is the estimate for the pooled variance.

Did you notice that I use t_{0} to represent the test-statistic?

Yes, I am getting to T with it.

It is a logical extension of the idea that, for one sample, t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom.

There, the idea was derived from the fact that when you replace the population variance with sample variance, and the sample variance is related to a Chi-square distribution with (n-1) degrees of freedom, the test-statistic t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}} follows a T-distribution with (n-1) degrees of freedom. Check out Lesson 73 (Learning from “Student”) to refresh your memory.

Here, in the case of the difference in means between two samples (\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}), the pooled population variance \sigma^{2} is replaced by its unbiased estimator (\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}, which, in turn, is related to a Chi-square distribution with (n_{1}+n_{2}-2) degrees of freedom.

Hence, under the proposition that the population variance of the two random variables X_{1} and X_{2} are equal, the test-statistic is t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}, and it follows a T-distribution with (n_{1}+n_{2}-2) degrees of freedom.


Now let’s evaluate the hypothesis that Tom set up — Finally!!

We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}

where s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}

In Tom’s case, both n_{1}=n_{2}=10. So the weights will be equal to 0.5.

s^{2}=0.5s_{1}^{2}+0.5s_{2}^{2}

s^{2}=0.5*58.28+0.5*54.89

s^{2}=56.58

t_{0} = \frac{12.5-12}{\sqrt{56.58(\frac{1}{10}+\frac{1}{10})}}

t_{0} = \frac{0.5}{\sqrt{11.316}}

t_{0} = 0.1486

The test-statistic is 0.1486. Since the alternate hypothesis is that the difference is greater than zeros (H_{A}:\mu_{1}-\mu_{2}>0), Tom has to verify how likely it is to see a value greater than 0.1486 in the null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. A smaller than \alpha probability indicates that the difference is sufficiently large enough that, in a T-distribution with so many degrees of freedom, the likelihood of seeing a value greater than the test-statistic is small. In other words, the difference in the mean is already sufficiently greater than zero and in the region of rejection.

Look at this visual.

The distribution is a T-distribution with 18 degrees of freedom (10 + 10 – 2). Tom had collected ten samples each for this test. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.33.

1.33 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with eighteen degrees of freedom.

If the test statistic (t_{o}) is greater than t_{critical}, which is 1.33, he will reject the null hypothesis. At that point (i.e., at values greater than 1.33), there would be sufficient confidence to say that the difference is significantly greater than zero.

This decision is equivalent to rejecting the null hypothesis if P(T>t_{0}) (the p-value) is less than \alpha.

We can read t_{critical} off the standard T-table, or P(T>t_{0}) can be computed from the distribution.

At df = 18, and \alpha=0.1, t_{critical}=1.33 and P(T>t_{0})=0.44.

Since the test-statistic t_{0} is not in the rejection region, or since p-value > \alpha, Tom cannot reject the null hypothesis H_{0} that \mu_{1} - \mu_{2}=0
He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

Tom is not convinced. The endless newspaper stories on arsenic levels are bothering him. He begins to think if the factory is illegally dumping their untreated waste into both the west and east branches of the Mohawk River. That is one reason why he might have seen no significant difference in the concentrations.

Over the next week, he takes Ron with him to Utica River, a tributary of Mohawk that branches off right before Mohawk meets the factory. If his new theory is correct, he should find that the mean arsenic concentration in either the west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

Ron again helped him with the laboratory testing. In parts per billion, the third sample data looks like this.

Utica River: 4, 4, 6, 4, 5, 7, 8

There are 7 data points. n_{3}=7

The sample mean \bar{x_{3}}=5.43

The sample variance s_{3}^{2} = 2.62

Can you help Tom with his new hypothesis tests? Does the proposition still hold?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)