Lesson 90 – The One-Sample Hypothesis Tests Using the Bootstrap

Hypothesis Tests – Part V

$H_{0}: P(\theta > \theta^{*}) = 0.5$

$H_{A}: P(\theta > \theta^{*}) > 0.5$

$H_{A}: P(\theta > \theta^{*}) < 0.5$

$H_{A}: P(\theta > \theta^{*}) \neq 0.5$

Jenny and Joe meet after 18 lessons.

I heard you are neck-deep into the hypothesis testing concepts.

Yes. And I am having fun learning about how to test various hypotheses, be in on the mean, on the standard deviation, or the proportion. It is also enlightening to learn how to approximate the null distribution using the limiting distribution concepts.

True. You have seen in lesson 86 — hypothesis tests on the proportion, that the null distribution is a Binomial distribution with n, the sample size, and p, the proportion being tested.

You have seen in lesson 87 — hypothesis tests on the mean, that the null distribution is a T-distribution because $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}$

You have seen in lesson 88 — hypothesis tests on the variance, that the null distribution is a Chi-square distribution because $\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}$

Did you ever wonder what if the test statistic is more complicated mathematically than the mean or the variance or the proportion and if its limiting distribution or the null distribution is hard to derive?

Or, did you ever ask what if the assumptions that go into deriving the null distribution are not met or not fully satisfied?

Can you give an example?

Suppose you want to test the hypothesis on the median or the interquartile range or the skewness of a distribution?

Or, if you are unsure about the distributional nature of the sample data? For instance, the assumption that $\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}$ is based on the premise that the sample data is normally distributed.

😕 There are non-parametric or distribution-free approaches? I remember Devine mentioning a bootstrap approach.

😎 In lesson 79, we learned about the concept of the bootstrap. Using the idea of the bootstrap, we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely (with a probability of 1/n), we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution $f$ .

Hmm. Since each bootstrap replicate is a possible representation of the population, we can compute the test statistics from this bootstrap sample. And, by repeating this, we can have many simulated values of the test statistics to create the null distribution against which we can test the hypothesis.

Exactly. No need to make any assumption on the distributional nature of the data at hand, or the kind of the limiting distribution for the test statistic. We can compute any test statistic from the bootstrap replicates and test the basis value using this simulated null distribution. You want to test the hypothesis on the median, go for it. On the skewness or geometric mean, no problem.

This sounds like an exciting approach that will free up the limitations. Why don’t we do it step by step and elaborate on the details. Our readers will appreciate it.

Absolutely. Do you want to use your professor’s hypothesis that the standard deviation of his class’s performance is 16.5 points, as a case in point?

Sure. In a recent conversation, he also revealed that the mean and median scores are 54 and 55 points, respectively and that 20% of his class usually get a score of less than 40.

Aha. We can test all four hypotheses then. Let’s take the sample data.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Yes, this is a sample of ten exam scores from our most recent test with him.

Let’s first review the concept of the bootstrap. We have the following data.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these ten data points is 1/10, we can randomly draw ten numbers from these ten values — with replacement.

Yes, I can recall from lesson 79 that this is like playing the game of Bingo where the chips are these ten numbers. Each time we get a number, we put it back and roll it again until we draw ten numbers.

Yes. For real computations, we use a computer program that has this algorithm coded in. We draw a random number from a uniform distribution ( $f(u)$ ) where $u$ is between 0 and 1. These randomly drawn $u's$ are mapped onto the ranked data to draw a specific value from the set. For example, in a set of ten values, for a randomly drawn u of 0.1, we can draw the first value in order — 33.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (60, 41, 70, 61, 69, 95, 33, 34, 82, 82), some may appear more than one time, and some may not show up at all in a random sample.

Let me create a bootstrap replicate.

See, 70 appeared two times, 82 appeared three times, and 33 did not get selected at all.

Such bootstrap replicates are representations of the empirical distribution $\hat{f}$ . The empirical distribution $\hat{f}$ is the proportion of times each value in the data sample $x_{1}, x_{2}, x_{3}, …, x_{n}$ occurs. If we assume that the data sample has been generated by randomly sampling from the true distribution, then, the empirical distribution (i.e., the observed frequency) $\hat{f}$ is a sufficient statistic for the true distribution $f$ .

In other words, all the information contained in the true distribution can be generated by creating $\hat{f}$ , the empirical distribution.

Yes. Since an unknown population distribution $f$ has produced the observed data $x_{1}, x_{2}, x_{3}, …, x_{n}$ , we can use the observed data to approximate $f$ by its empirical distribution $\hat{f}$ and then use $\hat{f}$ to generate bootstrap replicates of the data.

How do we implement the hypothesis test then?

Using the same hypothesis testing framework. We first establish the null and the alternative hypothesis.

$H_{0}: P(\theta > \theta^{*}) = 0.5$

$\theta$ is the test statistic computed from the bootstrap replicate and $\theta^{*}$ is the basis value that we are testing. For example, a standard deviation of 16.5 is $\theta^{*}$ and standard deviation computed from one bootstrap sample is $\theta$ .

The alternate hypothesis is then,

$H_{A}: P(\theta > \theta^{*}) > 0.5$

$H_{A}: P(\theta > \theta^{*}) < 0.5$

$H_{A}: P(\theta > \theta^{*}) \neq 0.5$

Essentially, for each bootstrap replicate i, we check whether $\theta_{i} > \theta^{*}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

Now, we can repeat this process, i.e., creating a bootstrap replicate, computing the test statistic and verifying whether $\theta_{i} > \theta^{*}$ or $S_{i} \in (0,1)$ a large number of times, say N = 10,000. The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated test statistics is the p-value.

And we can apply the rule of rejection if the $p-value < \alpha$ , the selected rate of rejection.

Correct. That is for a one-sided hypothesis test. If it is a two-sided hypothesis test, we use the rule $\frac{\alpha}{2} \le p-value < 1- \frac{\alpha}{2}$ for non-rejection, i.e., we cannot reject the null hypothesis if the p-value is between $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ .

Great! For the first bootstrap sample, if we were to verify the four hypotheses, we register the following.

Since the bootstrap sample mean $\bar{x}_{boot}=67.8$ is greater than the basis of 54, we register $S_{i}=1$ .

Since the bootstrap sample median $\tilde{x}_{boot}=69.5$ is greater than the basis of 55, we register $S_{i}=1$ .

Since the bootstrap sample standard deviation $\sigma_{boot}=14.46$ is less than the basis of 16.5, we register $S_{i}=0$ .

Finally, since the bootstrap sample proportion $p_{boot}=0.1$ is less than the basis of 0.2, we register $S_{i}=0$ .

We do this for a large number of bootstrap samples. Here is an illustration of the test statistics for three bootstrap replicates.

Let me run the hypothesis test on the mean using N = 10,000. I am creating 10,000 bootstrap-replicated test statistics.

The distribution of the test statistics is the null distribution of the mean. Notice that it resembles a normal distribution. The basis value of 54 is shown using a blue square on the distribution. From the null distribution, the proportion of times $S_{i}=1$ is 0.91. 91% of the $\bar{x}$ test statistics are greater than 54.

Our null hypothesis is
$H_{0}: P(\bar{x} > 54) = 0.5$

Our alternate hypothesis is one-sided. $H_{A}: P(\bar{x} > 54) < 0.5$

Since the p-value is greater than a 5% rejection rate, we cannot reject the null hypothesis.

If the basis value $\mu$ is far out on the null distribution of $\bar{x}$ that less than 5% of the bootstrap-replicated test statistics are greater than $\mu$ , we would have rejected the null hypothesis.

Shall we run the hypothesis test on the median?

$H_{0}: P(\tilde{x} > 55) = 0.5$

$H_{A}: P(\tilde{x} > 55) < 0.5$

Again, a one-sided test.

Sure. Here is the answer.

We can see the null distribution of the test statistic (median from the bootstrap samples) along with the basis value of 55.

86% of the test statistics are greater than this basis value. Hence, we cannot reject the null hypothesis.

The null distribution of the test statistic does not resemble any known distribution.

Yes. Since the bootstrap-based hypothesis test is distribution-free (non-parametric), not knowing the nature of the limiting distribution of the test statistic (median) does not restrain us.

Awesome. Let me also run the test for the standard deviation.

$H_{0}: P(\sigma > 16.5) = 0.5$

$H_{A}: P(\sigma > 16.5) \ne 0.5$

I am taking a two-sided test since a deviation in either direction, i.e., too small a standard deviation or too large of a standard deviation will disprove the hypothesis.

Here is the result.

The p-value is 0.85, i.e., 85% of the bootstrap-replicated test statistics are greater than 16.5. Since the p-value is greater than the acceptable rate of rejection, we cannot reject the null hypothesis.

If the p-value were less than 0.025 or greater than 0.975, then we would have rejected the null hypothesis.

For a p-value of 0.025, 97.5% of the bootstrap-replicated standard deviations will be less than 16.5 — strong evidence that the null distribution produces values much less than 16.5. For a p-value of 0.975, 97.5% of the bootstrap-replicated standard deviations will be greater than 16.5 — strong evidence that the null distribution produces values much greater than 16.5. In either of the sides, we reject the null hypothesis that the standard deviation is 16.5.

Let me complete the hypothesis test on the proportion.

$H_{0}: P(p > 0.2) = 0.5$

$H_{A}: P(p > 0.2) \ne 0.5$

Let’s take a two-sided test since deviation in either direction can disprove the null hypothesis. If we get a tiny proportion or a very high proportion compared to 0.2, we will reject the belief that the percentage of students obtaining a score of less than 40 is 0.2.

Here are the null distribution and the result from the test.

The p-value is 0.32. 3200 out of the 10000 bootstrap-replicated proportions are greater than 0.2. Since it is between 0.025 and 0.975, we cannot reject the null hypothesis.

You can see how widely the bootstrap concept can be applied for hypothesis testing and what flexibility it provides.

To summarize:

Repeatedly sample with replacement from the original sample data. 
Each time, draw a sample of size n.

Compute the desired statistic from each bootstrap sample.
(mean, median, standard deviation, interquartile range, 
 proportion, skewness, etc.)

Null hypothesis  can now be tested as follows:

 if , else, 

 
(average over all N bootstrap-replicated test statistics)

If  or , reject the null hypothesis 
(for a two-sided hypothesis test at a selected rejection rate of )

If , reject the null hypothesis 
(for a left-sided hypothesis test at a rejection rate of )

If , reject the null hypothesis 
(for a right-sided hypothesis test at a rejection rate of )