Lesson 90 – The One-Sample Hypothesis Tests Using the Bootstrap

Hypothesis Tests – Part V

H_{0}: P(\theta > \theta^{*}) = 0.5

H_{A}: P(\theta > \theta^{*}) > 0.5

H_{A}: P(\theta > \theta^{*}) < 0.5

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Jenny and Joe meet after 18 lessons.

I heard you are neck-deep into the hypothesis testing concepts.

Yes. And I am having fun learning about how to test various hypotheses, be in on the mean, on the standard deviation, or the proportion. It is also enlightening to learn how to approximate the null distribution using the limiting distribution concepts.

True. You have seen in lesson 86 — hypothesis tests on the proportion, that the null distribution is a Binomial distribution with n, the sample size, and p, the proportion being tested.

You have seen in lesson 87 — hypothesis tests on the mean, that the null distribution is a T-distribution because \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}

You have seen in lesson 88 — hypothesis tests on the variance, that the null distribution is a Chi-square distribution because \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}

Did you ever wonder what if the test statistic is more complicated mathematically than the mean or the variance or the proportion and if its limiting distribution or the null distribution is hard to derive?

Or, did you ever ask what if the assumptions that go into deriving the null distribution are not met or not fully satisfied?

Can you give an example?

Suppose you want to test the hypothesis on the median or the interquartile range or the skewness of a distribution?

Or, if you are unsure about the distributional nature of the sample data? For instance, the assumption that \frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1} is based on the premise that the sample data is normally distributed.

😕 There are non-parametric or distribution-free approaches? I remember Devine mentioning a bootstrap approach.

😎 In lesson 79, we learned about the concept of the bootstrap. Using the idea of the bootstrap, we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely (with a probability of 1/n), we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Hmm. Since each bootstrap replicate is a possible representation of the population, we can compute the test statistics from this bootstrap sample. And, by repeating this, we can have many simulated values of the test statistics to create the null distribution against which we can test the hypothesis.

Exactly. No need to make any assumption on the distributional nature of the data at hand, or the kind of the limiting distribution for the test statistic. We can compute any test statistic from the bootstrap replicates and test the basis value using this simulated null distribution. You want to test the hypothesis on the median, go for it. On the skewness or geometric mean, no problem.

This sounds like an exciting approach that will free up the limitations. Why don’t we do it step by step and elaborate on the details. Our readers will appreciate it.

Absolutely. Do you want to use your professor’s hypothesis that the standard deviation of his class’s performance is 16.5 points, as a case in point?

Sure. In a recent conversation, he also revealed that the mean and median scores are 54 and 55 points, respectively and that 20% of his class usually get a score of less than 40.

Aha. We can test all four hypotheses then. Let’s take the sample data.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Yes, this is a sample of ten exam scores from our most recent test with him.

Let’s first review the concept of the bootstrap. We have the following data.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these ten data points is 1/10, we can randomly draw ten numbers from these ten values — with replacement.

Yes, I can recall from lesson 79 that this is like playing the game of Bingo where the chips are these ten numbers. Each time we get a number, we put it back and roll it again until we draw ten numbers.

Yes. For real computations, we use a computer program that has this algorithm coded in. We draw a random number from a uniform distribution (f(u)) where u is between 0 and 1. These randomly drawn u's are mapped onto the ranked data to draw a specific value from the set. For example, in a set of ten values, for a randomly drawn u of 0.1, we can draw the first value in order — 33.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (60, 41, 70, 61, 69, 95, 33, 34, 82, 82), some may appear more than one time, and some may not show up at all in a random sample.

Let me create a bootstrap replicate.

See, 70 appeared two times, 82 appeared three times, and 33 did not get selected at all.

Such bootstrap replicates are representations of the empirical distribution \hat{f}. The empirical distribution \hat{f} is the proportion of times each value in the data sample x_{1}, x_{2}, x_{3}, …, x_{n} occurs. If we assume that the data sample has been generated by randomly sampling from the true distribution, then, the empirical distribution (i.e., the observed frequency) \hat{f} is a sufficient statistic for the true distribution f.

In other words, all the information contained in the true distribution can be generated by creating \hat{f}, the empirical distribution.

Yes. Since an unknown population distribution f has produced the observed data x_{1}, x_{2}, x_{3}, …, x_{n}, we can use the observed data to approximate f by its empirical distribution \hat{f} and then use \hat{f} to generate bootstrap replicates of the data.

How do we implement the hypothesis test then?

Using the same hypothesis testing framework. We first establish the null and the alternative hypothesis.

H_{0}: P(\theta > \theta^{*}) = 0.5

\theta is the test statistic computed from the bootstrap replicate and \theta^{*} is the basis value that we are testing. For example, a standard deviation of 16.5 is \theta^{*} and standard deviation computed from one bootstrap sample is \theta.

The alternate hypothesis is then,

H_{A}: P(\theta > \theta^{*}) > 0.5

or

H_{A}: P(\theta > \theta^{*}) < 0.5

or

H_{A}: P(\theta > \theta^{*}) \neq 0.5

Essentially, for each bootstrap replicate i, we check whether \theta_{i} > \theta^{*}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

Now, we can repeat this process, i.e., creating a bootstrap replicate, computing the test statistic and verifying whether \theta_{i} > \theta^{*} or S_{i} \in (0,1) a large number of times, say N = 10,000. The proportion of times S_{i} = 1 in a set of N bootstrap-replicated test statistics is the p-value.

And we can apply the rule of rejection if the p-value < \alpha, the selected rate of rejection.

Correct. That is for a one-sided hypothesis test. If it is a two-sided hypothesis test, we use the rule \frac{\alpha}{2} \le p-value < 1- \frac{\alpha}{2} for non-rejection, i.e., we cannot reject the null hypothesis if the p-value is between \frac{\alpha}{2} and 1-\frac{\alpha}{2}.

Great! For the first bootstrap sample, if we were to verify the four hypotheses, we register the following.

Since the bootstrap sample mean \bar{x}_{boot}=67.8 is greater than the basis of 54, we register S_{i}=1.

Since the bootstrap sample median \tilde{x}_{boot}=69.5 is greater than the basis of 55, we register S_{i}=1.

Since the bootstrap sample standard deviation \sigma_{boot}=14.46 is less than the basis of 16.5, we register S_{i}=0.

Finally, since the bootstrap sample proportion p_{boot}=0.1 is less than the basis of 0.2, we register S_{i}=0.

We do this for a large number of bootstrap samples. Here is an illustration of the test statistics for three bootstrap replicates.

Let me run the hypothesis test on the mean using N = 10,000. I am creating 10,000 bootstrap-replicated test statistics.

The distribution of the test statistics is the null distribution of the mean. Notice that it resembles a normal distribution. The basis value of 54 is shown using a blue square on the distribution. From the null distribution, the proportion of times S_{i}=1 is 0.91. 91% of the \bar{x} test statistics are greater than 54.

Our null hypothesis is
H_{0}: P(\bar{x} > 54) = 0.5

Our alternate hypothesis is one-sided. H_{A}: P(\bar{x} > 54) < 0.5

Since the p-value is greater than a 5% rejection rate, we cannot reject the null hypothesis.

If the basis value \mu is far out on the null distribution of \bar{x} that less than 5% of the bootstrap-replicated test statistics are greater than \mu, we would have rejected the null hypothesis.

Shall we run the hypothesis test on the median?

H_{0}: P(\tilde{x} > 55) = 0.5

H_{A}: P(\tilde{x} > 55) < 0.5

Again, a one-sided test.

Sure. Here is the answer.

We can see the null distribution of the test statistic (median from the bootstrap samples) along with the basis value of 55.

86% of the test statistics are greater than this basis value. Hence, we cannot reject the null hypothesis.

The null distribution of the test statistic does not resemble any known distribution.

Yes. Since the bootstrap-based hypothesis test is distribution-free (non-parametric), not knowing the nature of the limiting distribution of the test statistic (median) does not restrain us.

Awesome. Let me also run the test for the standard deviation.

H_{0}: P(\sigma > 16.5) = 0.5

H_{A}: P(\sigma > 16.5) \ne 0.5

I am taking a two-sided test since a deviation in either direction, i.e., too small a standard deviation or too large of a standard deviation will disprove the hypothesis.

Here is the result.

The p-value is 0.85, i.e., 85% of the bootstrap-replicated test statistics are greater than 16.5. Since the p-value is greater than the acceptable rate of rejection, we cannot reject the null hypothesis.

If the p-value were less than 0.025 or greater than 0.975, then we would have rejected the null hypothesis.

For a p-value of 0.025, 97.5% of the bootstrap-replicated standard deviations will be less than 16.5 — strong evidence that the null distribution produces values much less than 16.5. For a p-value of 0.975, 97.5% of the bootstrap-replicated standard deviations will be greater than 16.5 — strong evidence that the null distribution produces values much greater than 16.5. In either of the sides, we reject the null hypothesis that the standard deviation is 16.5.

Let me complete the hypothesis test on the proportion.

H_{0}: P(p > 0.2) = 0.5

H_{A}: P(p > 0.2) \ne 0.5

Let’s take a two-sided test since deviation in either direction can disprove the null hypothesis. If we get a tiny proportion or a very high proportion compared to 0.2, we will reject the belief that the percentage of students obtaining a score of less than 40 is 0.2.

Here are the null distribution and the result from the test.

The p-value is 0.32. 3200 out of the 10000 bootstrap-replicated proportions are greater than 0.2. Since it is between 0.025 and 0.975, we cannot reject the null hypothesis.

You can see how widely the bootstrap concept can be applied for hypothesis testing and what flexibility it provides.

To summarize:

Repeatedly sample with replacement from the original sample data. 
Each time, draw a sample of size n.
Compute the desired statistic from each bootstrap sample.
(mean, median, standard deviation, interquartile range,
proportion, skewness, etc.)
Null hypothesis P(\theta > \theta^{*}) = 0.5 can now be tested as follows: 
S_{i}=1 if \theta_{i} > \theta^{*}, else, S_{i}=0

p-value = \frac{1}{N}\sum_{i=1}^{i=N} S_{i}
(average over all N bootstrap-replicated test statistics)

If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis
(for a two-sided hypothesis test at a selected rejection rate of \alpha)

If p-value < \alpha, reject the null hypothesis
(for a left-sided hypothesis test at a rejection rate of \alpha)

If p-value > 1 - \alpha, reject the null hypothesis
(for a right-sided hypothesis test at a rejection rate of \alpha)

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 88 – The One-Sample Hypothesis Test – Part III

On Variance

H_{0}: \sigma^{2} = \sigma^{2}_{0}

H_{A}: \sigma^{2} > \sigma^{2}_{0}

H_{A}: \sigma^{2} < \sigma^{2}_{0}

H_{A}: \sigma^{2} \neq \sigma^{2}_{0}

Joe is cheery after an intense semester at his college. He is meeting Devine today for a casual conversation. We all know that their casual conversation always turns into something interesting. Are we in for a new concept today?

Devine: So, how did you fare in your exams.

Joe: Hmm, I did okay, but, interestingly, you are asking me about my performance in exams and not what I learned in my classes.

Devine: Well, Joe, these days, the college prepares you to be a good test taker. Learning is a thing of the past. I am glad you are still learning in your classes.

Joe: That is true to a large extent. We have exams after exams after exams, and our minds are compartmentalized to regurgitate one module after the other — no time to sit back and absorb what we see in classes.

By the way, I heard of an intriguing phenomenon from one of my professors. It might be of interest to you.

Devine: What is it.

Joe: In his eons of teaching, he has observed that the standard deviation of his class’s performance is 16.5 points. He told me that over the years, this had fed back into his ways of preparing exams. It seems that he subconsciously designs exams where the students’ grades will have a standard deviation of 16.5.

Devine: That is indeed an interesting phenomenon. Do you want to verify his hypothesis?

Joe: How can we do that?

Devine: Assuming that his test scores are normally distributed, we can conduct a hypothesis test on the variance of the distribution — H_{0}: \sigma^{2} = \sigma^{2}_{0}

Joe: Using a hypothesis testing framework?

Devine: Yes. Let’s first outline a null and alternate hypothesis. Since your professor is claiming that his exams are subconsciously designed for a standard deviation of 16.5, we will establish that this is the null hypothesis.

H_{0}: \sigma^{2} = 16.5^{2}

We can falsify this claim if the standard deviation is greater than or less than 16.5, i.e.,

H_{A}: \sigma^{2} \neq 16.5^{2}

The alternate hypothesis is two-sided. Deviation in either direction (less than or greater than) will reject the null hypothesis.

Would you be able to get some data on his recent exam scores?

Joe: I think I can ask some of my friends and get a sample of up to ten scores. Let me make some calls.

Here is a sample of ten exam scores from our most recent test with him.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Devine: Fantastic. We can compute the standard deviation/variance from this sample and verify our hypothesis — whether this data provides evidence for the rejection of the null hypothesis.

Joe: Over the past few weeks, I was learning that we call it a parametric hypothesis test if we know the limiting form of the null distribution. I already know that we are doing a one-sample hypothesis test, but how do we know the type of the null distribution?

Devine: The sample variance (s^{2}) is a random variable that can be described using a probability distribution. Several weeks ago, in lesson 73 where we derived the T-distribution, and in lesson 75 where we derived the confidence interval of the variance, we learned that \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

Since it was more than ten lessons ago, let’s go through the derivation once again. Ofttimes, repetition helps reinforce the ideas.

Joe: I think I remember it vaguely. Let me take a shot at the derivation 🙂

I will start with the equation of the sample variance s^{2}.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}

I will move the n-1 term over to the left-hand side and do some algebra.

(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}

(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}

(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}

(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}

Let me divide both sides of the equation by \sigma^{2}.

\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}

The right-hand side now is the sum of squared standard normal distributions — assuming x_{i} are draws from a normal distribution.

\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + … + Z_{n}^{2} - Z^{2}

Sum of squares of (n - 1) standard normal random variables.

We learned in lesson 53 that if there are n standard normal random variables, Z_{1}, Z_{2}, …, Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom. Its probability density function is f(\chi)=\frac{\frac{1}{2}(\frac{1}{2} \chi)^{\frac{n}{2}-1}e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi^{2} > 0 and 0 otherwise.

Since we have \frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + … + Z_{n}^{2} - Z^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi_{n-1} with a probability distribution function f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

Depending on the degrees of freedom, the distribution of \frac{(n-1)s^{2}}{\sigma^{2}} looks like this.

Smaller sample sizes imply lower degrees of freedom. The distribution will be highly skewed; asymmetric.

Larger sample sizes or higher degrees of freedom will tend the distribution to symmetry.

Devine: Excellent job, Joe. As you have shown \frac{(n-1)s^{2}}{\sigma^{2}} is our test statistic, \chi^{2}_{0}, which we will verify against a Chi-square distribution with (n-1) degrees of freedom.

Have you already decided on a rejection rate \alpha?

Joe: I will go with a 5% Type I error. If my professor’s assumption is indeed true, I am willing to commit a 5% error in my decision-making as I may get a sample from my friends that drives me to reject his null hypothesis.

Devine: Okay. Let’s then compute the test statistic.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}=452.01

\chi^{2}_{0} = \frac{(n-1)s^{2}}{\sigma^{2}} = \frac{9*452.09}{16.5^{2}} = 14.95

Since we have a sample of ten exam scores, we should consider as null distribution, a Chi-square distribution with nine degrees of freedom.

Under the null hypothesis H_{0}: \sigma^{2} = 16.5^{2}, for a two-sided hypothesis test at the 5% rejection level, \frac{(n-1)s^{2}}{\sigma^{2}} can vary between \chi^{2}_{0.025} and \chi^{2}_{0.975}, the lower and the upper percentiles of the Chi-square distribution.

If our test statistic \chi^{2}_{0} is either less than, or greater than the lower and the upper percentiles respectively, we reject the null hypothesis.

The lower and upper critical values at the 5% rejection rate (or a 95% confidence interval) are 2.70 and 19.03.

In lesson 75, we learned how to read this off the standard Chi-square table.

Joe: Aha. Since our test statistic \chi^{2}_{0} is 14.95, we cannot reject the null hypothesis.

Devine: You are right. Look at this visual.

The rejection region based on the lower and the upper critical values (percentiles \chi^{2}_{0.025} and \chi^{2}_{0.975}) is shown in red triangles. The test statistic lies inside.

It is now easy to say that the p-value, i.e., P(\chi^{2}>\chi^{2}_{0}) or P(\chi^{2} \le \chi^{2}_{0}) is greater than \frac{\alpha}{2}.

Since we have a two-sided test, we compare the p-value with \frac{\alpha}{2}.

Hence we cannot reject the null hypothesis.

Joe: Looks like I cannot refute my professor’s observation that the standard deviation of his test scores is 16.5 points.

Devine: Yes, at the 5% rejection level, and assuming that his test scores are normally distributed.

Joe: Got it. If the test scores are not normally distributed, our assumption that \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution is questionable. How then can we test the hypothesis?

Devine: We can use a non-parametric test using a bootstrap approach.

Joe: How is that done?

Devine: You will have to wait until the non-parametric hypothesis test lessons for that. But let me ask you a question based on today’s lesson. What is the main difference between the hypothesis test on the mean, which you learned in lesson 87, and the hypothesis test on the variance which you learned here?

Joe: 😕 😕 😕

For the hypothesis test on the mean, we looked at the difference between \bar{x} and \mu. For the hypothesis on the variance, we examine the ratio of s^{2} to \sigma^{2} and reject the null hypothesis if this ratio differs too much from what we expect under the null hypothesis, i.e., when H_{0} is true.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 87 – The One-Sample Hypothesis Test – Part II

On Mean

H_{0}: \mu = \mu_{0}

H_{A}: \mu > \mu_{0}

H_{A}: \mu < \mu_{0}

H_{A}: \mu \neq \mu_{0}

Tom grew up in the City of Ruritania. He went to high school there, met his wife there, and has a happy home. Ruritania, the land of natural springs, is known for its pristine water. Tom’s house is alongside the west branch of Mohawk, one such pristine river. Every once in a while, Tom and his family go to the nature park on the banks of Mohawk. It is customary for Tom and his little one to take a swim.

Lately, he has been sensing a decline in the quality of the water. It is a scary feeling as the consequences are slow. Tom starts associating the cause of the poor water quality to this new factory in his neighborhood constructed just upstream of Mohawk

Whether or not the addition of this new factory in Tom’s neighborhood reduced the quality of water compared to EPA standards is to be seen.

He immediately checked the EPA specifications for dissolved oxygen concentration in the river, and it is required by the EPA to have a minimum average concentration of 2 mg/L. Over the next ten days, Tom collected ten water samples from the west branch and got his friend Ron to measure the dissolved oxygen in his lab. In mg/L, the data reads like this.

1.8, 2, 2.1, 1.7, 1.2, 2.3, 2.5, 2.9, 1.9, 2.2

Tom wants to test if the average dissolved oxygen he sees from the samples significantly deviates from the one specified by EPA.

Does \bar{x} deviate from 2 mg/L? Is that deviation large enough to prompt caution?

He does this investigation using the hypothesis testing framework.

Since the investigation is regarding \bar{x} , the sample mean, and whether it is different from a selected value, it is reasonable to say that Tom is conducting a one-sample hypothesis test

Tom knows this, and so do you and me — the sample mean (\bar{x} ) is a random variable, and it can be described using a probability distribution. If Tom gets more data samples, he will get a slightly different sample mean. The value of the estimate changes with the change of sample and this uncertainty can be represented using a normal distribution by the Central Limit Theorem.

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}]=\mu . Go down the memory lane and find why in Lesson 67.

The variance of the sample mean is V[\bar{x}]=\frac{\sigma^{2}}{n} . Variance tells us how widely the estimate is distributed around the center of the distribution. We know this from Lesson 68.

When we put these two together,

\bar{x} \sim N(\mu, \frac{\sigma^{2}}{n})

or,

\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)

Now, if the sample size (n) is large enough, it would be reasonable to substitute sample standard deviation (s) in place of \sigma.

When we substitute s for \sigma, we cannot just assume that \bar{x} will tend to a normal distribution.

W. S. Gosset (aka “Student”) taught us that \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} follows a T-distribution with (n-1) degrees of freedom. 

\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}

Anyhow, all this is to confirm that Tom is conducting a parametric hypothesis test

CHOOSE THE APPROPRIATE TEST; ONE-SAMPLE OR TWO-SAMPLE AND PARAMETRIC OR NONPARAMETRIC — check.

Tom establishes the null and alternate hypotheses. He assumes that the inception of the factory does not affect the water quality downstream of the Mohawk River. Hence,

H_{0}: \mu \ge 2 mg/L

H_{A}: \mu < 2 mg/L

The alternate hypothesis is one-sided. A significant deviation in one direction (less than) needs to be seen to reject the null hypothesis.

Notice that his null hypothesis is \mu \ge 2 mg/L since it is required by the EPA to have a minimum average concentration of 2 mg/L.

ESTABLISH THE NULL AND ALTERNATE HYPOTHESIS — check.

Tom is taking a 5% risk of rejecting the null hypothesis; \alpha =0.05 . His Type I error is 5%.

Suppose the factory does not affect the water quality, but, the ten samples he collected showed a sample mean much smaller than the EPA prescription of 2 mg/L, he should reject the null hypothesis — so he is committing an error (Type I error) in his decision making.

There is a certain level of subjectivity in the choice of \alpha . If Tom wants to see that the water quality is lower than 2 mg/L, he would perhaps choose to commit a greater error, i.e., select a larger value for \alpha .

If he wants to see that the water quality has not deteriorated, he will choose a smaller value for \alpha .

So, the decision to reject or not to reject the null hypothesis is based on \alpha .

DECIDE ON AN ACCEPTABLE RATE OF ERROR OR REJECTION RATE — check.

Since \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}, the null distribution is a T-distribution with (n-1) degrees of freedom.

The test statistics is then t_{0} = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}, and Tom has to verify how likely it is to see a value as large as t_{0} in the null distribution.

Look at his visual.

The distribution is a T-distribution with 9 degrees of freedom. Tom had collected ten samples for this test. 

Since he opted for a rejection level of 5%, there is a cutoff on the distribution at -1.83.

-1.83 is the quantile corresponding to a 5% probability (rate of rejection) for a T-distribution with nine degrees of freedom.

If the test statistic (t_{0}) is less than t_{critical} which is -1.83, he will reject the null hypothesis. 

This decision is equivalent to rejecting the null hypothesis if P(T \le t_{0}) (the p-value) is less than \alpha.

From his data, the sample mean (\bar{x}) is 2.06 and the sample standard deviation (s) is 0.46. 

t_{0} = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}=\frac{2.06-2}{\frac{0.46}{\sqrt{10}}}=0.41.

t_{critical} can be read off from the standard T-Table, or P(T \le 0.41) can be computed from the distribution. 

At df = 9, and \alpha=5\%, t_{critical}=-1.83 and P(T \le 0.41)=0.66.

COMPUTE THE TEST STATISTIC AND ITS CORRESPONDING P-VALUE FROM THE OBSERVED DATA — check.

Since the test-statistics t_{0} is not in the rejection region, or since p-value > \alpha, Tom cannot reject the null hypothesis that \mu \ge 2 mg/L. 

MAKE THE DECISION; REJECT THE NULL HYPOTHESIS IF THE P-VALUE IS LESS THAN THE ACCEPTABLE RATE OF ERROR — check.

Tom could easily have checked the confidence interval of the true mean to make this decision. Recall that the confidence interval is the range or an interval where the true value will be. So, based on the T-distribution with df = 9, Tom could develop the 90% confidence interval (why 90%?) and check if \mu = 2 mg/L is within that confidence interval.

Look at this visual. Tom just did that.

The confidence interval is from 1.8 mg/L to 2.32 mg/L and the null hypothesis that \mu = 2 mg/L is within the interval.

Hence, he cannot reject the null hypothesis.

While looking at the confidence interval gives us a visual intuition on what decision to make, it is always better to compute the p-value and compare it to the rejection rate.

Together, the p-value and \alpha provide the risk levels associated with decisions.

In this journey through the hypothesis framework, the next time we meet, we will unfold the knots of the test on the variance. Till then, meditate on this.

For a hypothesis test, just reporting the p-value in itself is more informative. Once the p-value is known, any person who understands the context of the problem can decide for themselves whether or not to reject the null hypothesis. In other words, they can set their level of rejection and compare the p-value to it.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 86 – The One-Sample Hypothesis Test – Part I

On Proportion

H_{0}: p = p_{0}

H_{A}: p > p_{0}

H_{A}: p < p_{0}

H_{A}: p \neq p_{0}

Our journey to the abyss of hypothesis tests begins today. In lesson 85, Joe and Devine, in their casual conversation about the success rate of a memory-boosting mocha, introduced us to the elements of hypothesis testing. Their conversation presented a statistical hypothesis test on proportion — whether the percentage of people who would benefit from the memory-booster coffee is higher than the percentage who would claim benefit randomly.

In this lesson, using a similar example on proportion, we will dig deeper into the elements of hypothesis testing.

To reiterate the central concept, we wish to test our assumption (null hypothesis H_{0}) against an alternate assumption (alternate hypothesis H_{A}). The purpose of a hypothesis test, then, is to verify whether empirical data supports the rejection of the null hypothesis.


Let’s assume that there is a vacation planner company in Palm Beach, Florida. They are finishing up their new Paradise Resorts and advertised that this Paradise Resorts’ main attraction is it’s five out of seven bright and sunny days!

.

.

.

I know what you are thinking. It’s Florida, and five out of seven bright and sunny? What about the muggy thunderstorms and summer hurricanes?

Let’s keep that skepticism and consider their claim as a proposition, a hypothesis.

Since they claim that their resorts will have five out of seven bright and sunny days, we can assume a null hypothesis (H_{0}) that p = \frac{5}{7}. We can pit this against an alternate hypothesis (H_{A}) that p < \frac{5}{7} and use observational (experimental or empirical) data to verify whether H_{0} can be rejected.

We can go down to Palm Beach and observe the weather for a few days. Or, we may have been to Palm Beach enough number of times that we can bring that empirical data out of our memory. Suppose we observe or remember that seven out of 15 days, we had bright and sunny days.

With this information, we are ready to investigate Paradise Resorts’ claim.

Let’s refresh our memory on the essential steps for any hypothesis test.

1. Choose the appropriate test; one-sample or two-sample and parametric or nonparametric. 

2.
Establish the null and alternate hypothesis.

3.
Decide on an acceptable rate of error or rejection rate (\alpha).

4.
Compute the test statistic and its corresponding p-value from the observed data.

5.
Make the decision; Reject the null hypothesis if the p-value is less than the acceptable rate of error, \alpha.
Choose the appropriate test; one-sample or two-sample and parametric or nonparametric

We are verifying a statement about the parameter (proportion, p) of the population — whether or not p = \frac{5}{7}. So it is a one-sample hypothesis test. Since we are testing for proportion, we can assume a binomial distribution to derive the probabilities. So it is a parametric hypothesis test

Establish the null and alternate hypothesis

Paradise Resorts’ claim is the null hypothesis — five out of seven bright and sunny days. The alternate hypothesis is that the proportion is less than what they claim.

H_{0}: p = \frac{5}{7}

H_{A}: p < \frac{5}{7}

We are considering a one-sided alternative since departures in one direction (less than) are sufficient to reject the null hypothesis.

Decide on an acceptable rate of error or rejection rate \alpha

Our decision on the acceptable rate of rejection is the risk we take for rejecting the truth. If we select 10% for \alpha, it implies that we are rejecting the null hypothesis 10% of the times. If the null hypothesis is true, by rejecting it, we are committing an error — Type I error.

A simple thought exercise will make this concept more clear. Suppose Paradise Resorts’ claim is true — the proportion of bright and sunny days is \frac{5}{7}. But, our observation provided a sample out of the population where we ended up seeing very few bright and sunny days. In this case, we have to reject the null hypothesis. We committed an error in our decision. By selecting \alpha, we are choosing the acceptable rate of error. We are accepting that we might reject the null hypothesis (when it is true), \alpha\% of the time.

The next step is to create the null distribution.

At the beginning of the test, we agreed that we observed seven out of 15 days to be bright and sunny. We collected a sample of 15 days out of which seven days were bright and sunny. The null distribution is the probability distribution of observing any number of days being bright and sunny, i.e., out of the 15 days, we could have had 0, 1, 2, 3, …, 14, 15 days to be bright and sunny. The null distribution is the distribution of the probability of observing these outcomes. In a Binomial null distribution with n=15 and p = 5/7, what is the probability of getting 0, 1, 2, …, 15?

P(X=0)={15 \choose 0} p^{0}(1-p)^{15-0}

P(X=1)={15 \choose 1} p^{1}(1-p)^{15-1}

.
.
.
P(X=15)={15 \choose 15} p^{15}(1-p)^{15-15}

It will look like this.

On this null distribution, you also see the region of rejection as defined by the selected rejection rate \alpha. Here, \alpha=10\%. In this null distribution, the quantile corresponding to \alpha=10\% is 8 days. Hence, if we observe more than eight bright and sunny days, we are not in the rejection region, and, if we observe eight or less bright and sunny days, we are in the rejection region.

Compute the test statistic and its corresponding p-value from the observed data

Next, the question we ask is this.

In a Binomial null distribution with n = 15 and p = 5/7, what is the probability of getting a value that is as large as 7? If the value has a sufficiently low probability, we cannot say that it may occur by chance.

This probability is called the p-value. It is the probability of obtaining the computed test statistic under the null hypothesis. The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

P(X \le 7)=\sum_{x=0}^{x=7}{15 \choose x} p^{x}(1-p)^{15-x}=0.04

You can see this probability in the figure below. The grey shade within the pink shade is the p-value.

Make the decision; Reject the null hypothesis if the p-value is less than the acceptable rate of error

It is evident at this point. Since the p-value (0.04) is less than our selected rate of error (0.1), we reject the null hypothesis, i.e., we reject Paradise Resorts’ claim that there will be five out of seven bright and sunny days.



This decision is based on the assumption that the null hypothesis is correct. Under this assumption, since we selected \alpha=10\%, we will reject the true null hypothesis 10% of the time. At the same time, we will fail to reject the null hypothesis 90% of the time. In other words, 90% of the time, our decision to not reject the null hypothesis will be correct.

Now, suppose Paradise Resorts’ hypothesis is false, i.e., they mistakenly think that there are five out of the seven bright and sunny days. However, it is not five in seven, but four in seven. What would be the consequence of their false null hypothesis?

.

.

.

Let’s think this through again.
Our testing framework is based on the assumption that

H_{0}: p = \frac{5}{7}

H_{A}: p < \frac{5}{7}

For this test, we select \alpha=10\% and make decisions based on the observed outcomes.

Accordingly, if we observe eight or less bright and sunny days, we will reject the hypothesis, and, if we see more than eight bright and sunny days, we will fail to reject the null hypothesis. Based on \alpha=10\% and the assumed hypothesis that p = \frac{5}{7}, we fix eight as our cutoff point.

Paradise also thinks that p = \frac{5}{7}. If they are under a false assumption and we tested it based on this assumption, we might also commit an error — not rejecting the null hypothesis when it is false. This error is called Type II error or the lack of power in the test.

Look at this image. It has the null hypothesis under our original assumption and the selected \alpha=10\% and its corresponding quantile — 8 days. In the same image, we also see the null distribution if p = \frac{4}{7}. On this null distribution, there is a grey shaded region, which is the probability of not rejecting it based on \alpha=10\% and quantile — 8 days. We assign a symbol \beta for this probability.

What is more interesting is its complement, 1-\beta, which is the probability of rejecting the null hypothesis when it is false. Based on our original assumption (which is false), we selected eight days or less as our rejection region. At this cutoff, if there was another null distribution, 1-\beta is the probability of rejecting it. The key is the choice of \alpha or its corresponding quantile. At a chosen \alpha, 1-\beta measures the ability of the test to reject a false hypothesis. 1-\beta is called the power of the test.

In this example, if the original hypothesis is true, i.e., if H_{0}: p = \frac{5}{7} is true, we will reject it 10% of the time and will not reject it 90% of the time. However, if the hypothesis is false (and p = \frac{4}{7}), we will reject it 48% of the time and will not reject it 52% of the time.

For smaller p, the power of the test increases. In other words, if the proportion of bright and sunny days is smaller compared to the original assumption of 5/7, the probability of rejecting it increases.

Keep in mind that we will not know the actual value of p.

It is a thought that as the difference becomes larger, the original hypothesis is more and more false, and power (1-\beta) is a measure of the probability of rejecting this false hypothesis due to our choice of \alpha.

Look at this summary table. It provides a summary of our discussion of the error framework.

Type I and Type II errors are inversely related.

If we decrease \alpha, and if the null hypothesis is false, the probability of not rejecting it (\beta) will increase.

You can intuitively see that from the image that has the original (false) null distribution and possible true null distribution. If we move the quantile to the left (lower the rejection rate \alpha), the grey shaded region (probability of not rejecting a false null hypothesis, (\beta) increases.


At this point, you must be wondering that all of this is only for a sample of 15 days. What if we had more or fewer samples from the population?

The easiest way to understand the effect of sample size is to run the analysis for different n and different falsities (i.e., the difference from original p) and visualize it.

Here is one such analysis for three different sample sizes. The \alpha level that will be fixed based on the original hypothesis also varies by the sample size.

What we are seeing is the power function on the y-axis and the degree of falsity on the x-axis.

A higher degree of falsity implies that the null hypothesis is false by a greater magnitude. The first point on the x-axis is the fact that the null hypothesis is true. You can see that at this point, the power, i.e., the probability of rejecting the hypothesis, is 10%. At this point, we are just looking at \alpha, Type I error. As the degree of falsity increases, for that \alpha level, the power, 1-\beta (i.e., the probability of rejecting a false hypothesis) increases.

For a smaller sample size, the power increases slowly. For larger sample sizes, the power increases rapidly.

Of course, selecting the optimal sample size for the experiment based on low Type I and Type II errors is doable.

I am sure there are plenty of concepts here that will need some time to process, especially Type I and Type II errors. This week, we focused our energy on the hypothesis test for proportion. The next time we meet, we will unfold the knots of the hypothesis test on the mean.

Till then, happy learning.

If you are still unsure about Type I and Type II errors, this analogy will help.

If the null hypothesis for a judicial system is that the defendant is innocent, Type I error occurs when the jury convicts an innocent person; Type II error occurs when the jury sets a guilty person free.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)