sampling distribution – dataanalysisclassroom

Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-Sample Hypothesis Tests – Part VII

$H_{0}: P(\theta_{x}>\theta_{y}) = 0.5$

$H_{A}: P(\theta_{x}>\theta_{y}) > 0.5$

$H_{A}: P(\theta_{x}>\theta_{y}) < 0.5$

$H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5$

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “Open Data for All New Yorkers” page.

Interestingly, I found data on the bicycle counts for East River Bridges. New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

They highlight all non-holiday weekdays with no precipitation in yellow.

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

We are looking at the boxplots that present a nice visual of the data range and its percentiles. And we can compare one sample to another. Remember Lesson 14? There is a vertical line at **6352 bikes**, the maximum number of bikes on Manhattan Bridge during weekends, holidays, or rainy days — i.e., the non-highlighted days.

I want answers to the following questions.

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.

Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.

Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.

Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.

Is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge.  or

What do we know so far?

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal (Lesson 94) or using Welch’s t-Test when we cannot assume equality of population variances (Lesson 95). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means (Lesson 96).

We know how to test the equality of variances using F-distribution (Lesson 97).

We know how to test the difference in proportions using either Fisher’s Exact test (Lesson 92) or using the normal distribution as the null distribution under the large-sample approximation (Lesson 93).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.

What is the limiting distribution of the test-statistic that computes the difference in medians?
What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

The idea of the bootstrap is that we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely with a probability of 1/n, we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution f.

Take the data for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values — with replacement.

It is like playing the game of Bingo where the chips are these eight numbers. Each time we get a number, we put it back and roll it again until we draw eight numbers.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate.
6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate.
6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution $\hat{f}$ , i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating $\hat{f}$ , the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic. As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

Let’s now use the bootstrap method for two-sample hypothesis tests. Suppose there are two random variables, X and Y, and any statistic computed from them are $\theta_{x}$ and $\theta_{y}$ . We may have a sample of $n_{1}$ values representing X and a sample of $n_{2}$ values to represent Y.

$\theta_{x},\theta_{y}$ can be mean, median, variance, proportion, etc. Any computable statistic from the original data is of the form $\theta_{x},\theta_{y}$ .

The null hypothesis is that there is no difference between the statistic of X or Y.

$H_{0}: P(\theta_{x}>\theta_{y}) = 0.5$

The alternate hypothesis is

$H_{A}: P(\theta_{x}>\theta_{y}) > 0.5$

$H_{A}: P(\theta_{x}>\theta_{y}) < 0.5$

$H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5$

We first create a bootstrap replicate of X and Y by randomly drawing with replacement $n_{1}$ values from X and $n_{2}$ values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics $\theta_{x}$ and $\theta_{y}$ and check whether $\theta_{x}>\theta_{y}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

xboot: 6691 5311 6774 5311 6359 5311 5311 6052
yboot: 6775 6881 7341 7196 6775 7341 6775 7196

Mean of this bootstrap replicate for X and Y are 5890 and 7035. Since $\bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}$ , we register $S_{i}=0$

Another bootstrap replicate for X and Y may look like this:

xboot: 6774 6359 6359 6359 6052 5054 6052 6691
yboot: 6775 7196 7196 6026 6881 7341 6881 5711

Mean of this bootstrap replicate for X and Y are 6212.5 and 6750.875. Since $\bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}$ , we register $S_{i}=0$

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics $\theta_{x}$ and $\theta_{y}$ , and verifying whether $\theta_{x}>\theta_{y}$ and registering $S_{i} \in (0,1)$ a large number of times, say N=10,000.

The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated statistics is the p-value.

$p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}$

Based on the p-value, we can use the rule of rejection at the selected rate of rejection $\alpha$ .

For a one-sided alternate hypothesis, we reject the null hypothesis if $p-value < \alpha$ or $p-value > 1-\alpha$ .

For a two-sided alternate hypothesis, we reject the null hypothesis if $p-value < \frac{\alpha}{2}$ or $p-value > 1-\frac{\alpha}{2}$ .

Manhattan Bridge vs. Williamsburg Bridge

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

$H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5$

$H_{A}: P(\bar{x}_{M}> \bar{x}_{W}) \neq 0.5$

Let’s take a two-sided alternate hypothesis.

Here is the null distribution of $\frac{\bar{x}_{M}}{\bar{x}_{W}}$ for N = 10,000.

A vertical bar is shown at a ratio of 1 to indicate that the area beyond this value is the proportion of times $S_{i} = 1$ in a set of 10,000 bootstrap-replicated means.

The p-value is 0.0466. 466 out of the 10,000 bootstrap replicates had $\bar{x}_{M}>\bar{x}_{W}$ . For a 10% rate of error ( $\alpha=10\%$ ), we reject the null hypothesis since the p-value is less than 0.05. Since more than 95% of the times, the mean of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

Can we reject the null hypothesis if we select a 5% rate of error?

Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

$H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5$

$H_{A}: P(\tilde{x}_{M}> \tilde{x}_{W}) \neq 0.5$

The null distribution of $\frac{\tilde{x}_{M}}{\tilde{x}_{W}}$ for N = 10,000.

The p-value is 0.1549. 1549 out of the 10,000 bootstrap replicates had $\tilde{x}{M}>\tilde{x}{W}$ . For a 10% rate of error ( $\alpha=10\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (84.51% of the times) that the median of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.

Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

$H_{0}: P(s^{2}_{M}>s^{2}_{W}) = 0.5$

$H_{A}: P(s^{2}_{M}>s^{2}_{W}) \neq 0.5$

The null distribution of $\sqrt{\frac{s^{2}_{M}}{s^{2}_{W}}}$ for N = 10,000. We are looking at the null distribution of the ratio of the standard deviations.

The p-value is 0.4839. 4839 out of the 10,000 bootstrap replicates had $s^{2}_{M}>s^{2}_{W}$ . For a 10% rate of error ( $\alpha=10\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (51.61% of the times) that the variance of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.

Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

$H_{0}: P(IQR_{M}>IQR_{W}) = 0.5$

$H_{A}: P(IQR_{M}>IQR_{W}) \neq 0.5$

The null distribution of $\frac{IQR_{M}}{IQR_{W}}$ for N = 10,000. It does not resemble any known distribution, but that does not restrain us since the bootstrap-based hypothesis test is distribution-free.

The p-value is 0.5453. 5453 out of the 10,000 bootstrap replicates had $IQR_{M}>IQR_{W}$ . For a 10% rate of error ( $\alpha=10\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (45.47% of the times) that the interquartile range of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.

Finally, is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge?

$H_{0}: P(p_{M}>p_{W}) = 0.5$

$H_{A}: P(p_{M}>p_{W}) \neq 0.5$

The null distribution of $\frac{p_{M}}{p_{W}}$ for N = 10,000.

The p-value is 0.5991. 5991 out of the 10,000 bootstrap replicates had $p_{M}>p_{W}$ . For a 10% rate of error ( $\alpha=10\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (40.09% of the times) that the proportion of the total bike counts less than 6352 on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.

Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent.

To summarize,

Repeatedly sample with replacement from original samples of X and Y -- N times.

Each time draw a sample of size  from X and a sample of size  from Y.

Compute the desired statistic (mean, median, skew, etc.) from each
bootstrap sample.

The null hypothesis  can now be tested as follows:

$S_{i}=1$ if $\theta_{x}>\theta_{y}$ , else, $S_{i}=0$
$p-value=\frac{1}{N}\sum_{i=1}^{N}S_{i}$ (average over all N bootstrap-replicated statistics)
If $p-value < \frac{\alpha}{2}$ or $p-value > 1-\frac{\alpha}{2}$ , reject the null hypothesis for a two-sided hypothesis test at a selected rejection rate of $\alpha$ .
If $p-value < \alpha$ , reject the null hypothesis for a left-sided hypothesis test at a selected rejection rate of $\alpha$ .
If $p-value > 1-\alpha$ , reject the null hypothesis for a right-sided hypothesis test at a selected rejection rate of $\alpha$ .

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 90 – The One-Sample Hypothesis Tests Using the Bootstrap

Hypothesis Tests – Part V

$H_{0}: P(\theta > \theta^{*}) = 0.5$

$H_{A}: P(\theta > \theta^{*}) > 0.5$

$H_{A}: P(\theta > \theta^{*}) < 0.5$

$H_{A}: P(\theta > \theta^{*}) \neq 0.5$

Jenny and Joe meet after 18 lessons.

I heard you are neck-deep into the hypothesis testing concepts.

Yes. And I am having fun learning about how to test various hypotheses, be in on the mean, on the standard deviation, or the proportion. It is also enlightening to learn how to approximate the null distribution using the limiting distribution concepts.

True. You have seen in lesson 86 — hypothesis tests on the proportion, that the null distribution is a Binomial distribution with n, the sample size, and p, the proportion being tested.

You have seen in lesson 87 — hypothesis tests on the mean, that the null distribution is a T-distribution because $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{df=n-1}$

You have seen in lesson 88 — hypothesis tests on the variance, that the null distribution is a Chi-square distribution because $\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}$

Did you ever wonder what if the test statistic is more complicated mathematically than the mean or the variance or the proportion and if its limiting distribution or the null distribution is hard to derive?

Or, did you ever ask what if the assumptions that go into deriving the null distribution are not met or not fully satisfied?

Can you give an example?

Suppose you want to test the hypothesis on the median or the interquartile range or the skewness of a distribution?

Or, if you are unsure about the distributional nature of the sample data? For instance, the assumption that $\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{df=n-1}$ is based on the premise that the sample data is normally distributed.

😕 There are non-parametric or distribution-free approaches? I remember Devine mentioning a bootstrap approach.

😎 In lesson 79, we learned about the concept of the bootstrap. Using the idea of the bootstrap, we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely (with a probability of 1/n), we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution $f$ .

Hmm. Since each bootstrap replicate is a possible representation of the population, we can compute the test statistics from this bootstrap sample. And, by repeating this, we can have many simulated values of the test statistics to create the null distribution against which we can test the hypothesis.

Exactly. No need to make any assumption on the distributional nature of the data at hand, or the kind of the limiting distribution for the test statistic. We can compute any test statistic from the bootstrap replicates and test the basis value using this simulated null distribution. You want to test the hypothesis on the median, go for it. On the skewness or geometric mean, no problem.

This sounds like an exciting approach that will free up the limitations. Why don’t we do it step by step and elaborate on the details. Our readers will appreciate it.

Absolutely. Do you want to use your professor’s hypothesis that the standard deviation of his class’s performance is 16.5 points, as a case in point?

Sure. In a recent conversation, he also revealed that the mean and median scores are 54 and 55 points, respectively and that 20% of his class usually get a score of less than 40.

Aha. We can test all four hypotheses then. Let’s take the sample data.

60, 41, 70, 61, 69, 95, 33, 34, 82, 82

Yes, this is a sample of ten exam scores from our most recent test with him.

Let’s first review the concept of the bootstrap. We have the following data.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these ten data points is 1/10, we can randomly draw ten numbers from these ten values — with replacement.

Yes, I can recall from lesson 79 that this is like playing the game of Bingo where the chips are these ten numbers. Each time we get a number, we put it back and roll it again until we draw ten numbers.

Yes. For real computations, we use a computer program that has this algorithm coded in. We draw a random number from a uniform distribution ( $f(u)$ ) where $u$ is between 0 and 1. These randomly drawn $u's$ are mapped onto the ranked data to draw a specific value from the set. For example, in a set of ten values, for a randomly drawn u of 0.1, we can draw the first value in order — 33.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (60, 41, 70, 61, 69, 95, 33, 34, 82, 82), some may appear more than one time, and some may not show up at all in a random sample.

Let me create a bootstrap replicate.

See, 70 appeared two times, 82 appeared three times, and 33 did not get selected at all.

Such bootstrap replicates are representations of the empirical distribution $\hat{f}$ . The empirical distribution $\hat{f}$ is the proportion of times each value in the data sample $x_{1}, x_{2}, x_{3}, …, x_{n}$ occurs. If we assume that the data sample has been generated by randomly sampling from the true distribution, then, the empirical distribution (i.e., the observed frequency) $\hat{f}$ is a sufficient statistic for the true distribution $f$ .

In other words, all the information contained in the true distribution can be generated by creating $\hat{f}$ , the empirical distribution.

Yes. Since an unknown population distribution $f$ has produced the observed data $x_{1}, x_{2}, x_{3}, …, x_{n}$ , we can use the observed data to approximate $f$ by its empirical distribution $\hat{f}$ and then use $\hat{f}$ to generate bootstrap replicates of the data.

How do we implement the hypothesis test then?

Using the same hypothesis testing framework. We first establish the null and the alternative hypothesis.

$H_{0}: P(\theta > \theta^{*}) = 0.5$

$\theta$ is the test statistic computed from the bootstrap replicate and $\theta^{*}$ is the basis value that we are testing. For example, a standard deviation of 16.5 is $\theta^{*}$ and standard deviation computed from one bootstrap sample is $\theta$ .

The alternate hypothesis is then,

$H_{A}: P(\theta > \theta^{*}) > 0.5$

$H_{A}: P(\theta > \theta^{*}) < 0.5$

$H_{A}: P(\theta > \theta^{*}) \neq 0.5$

Essentially, for each bootstrap replicate i, we check whether $\theta_{i} > \theta^{*}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

Now, we can repeat this process, i.e., creating a bootstrap replicate, computing the test statistic and verifying whether $\theta_{i} > \theta^{*}$ or $S_{i} \in (0,1)$ a large number of times, say N = 10,000. The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated test statistics is the p-value.

And we can apply the rule of rejection if the $p-value < \alpha$ , the selected rate of rejection.

Correct. That is for a one-sided hypothesis test. If it is a two-sided hypothesis test, we use the rule $\frac{\alpha}{2} \le p-value < 1- \frac{\alpha}{2}$ for non-rejection, i.e., we cannot reject the null hypothesis if the p-value is between $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ .

Great! For the first bootstrap sample, if we were to verify the four hypotheses, we register the following.

Since the bootstrap sample mean $\bar{x}_{boot}=67.8$ is greater than the basis of 54, we register $S_{i}=1$ .

Since the bootstrap sample median $\tilde{x}_{boot}=69.5$ is greater than the basis of 55, we register $S_{i}=1$ .

Since the bootstrap sample standard deviation $\sigma_{boot}=14.46$ is less than the basis of 16.5, we register $S_{i}=0$ .

Finally, since the bootstrap sample proportion $p_{boot}=0.1$ is less than the basis of 0.2, we register $S_{i}=0$ .

We do this for a large number of bootstrap samples. Here is an illustration of the test statistics for three bootstrap replicates.

Let me run the hypothesis test on the mean using N = 10,000. I am creating 10,000 bootstrap-replicated test statistics.

The distribution of the test statistics is the null distribution of the mean. Notice that it resembles a normal distribution. The basis value of 54 is shown using a blue square on the distribution. From the null distribution, the proportion of times $S_{i}=1$ is 0.91. 91% of the $\bar{x}$ test statistics are greater than 54.

Our null hypothesis is
$H_{0}: P(\bar{x} > 54) = 0.5$

Our alternate hypothesis is one-sided. $H_{A}: P(\bar{x} > 54) < 0.5$

Since the p-value is greater than a 5% rejection rate, we cannot reject the null hypothesis.

If the basis value $\mu$ is far out on the null distribution of $\bar{x}$ that less than 5% of the bootstrap-replicated test statistics are greater than $\mu$ , we would have rejected the null hypothesis.

Shall we run the hypothesis test on the median?

$H_{0}: P(\tilde{x} > 55) = 0.5$

$H_{A}: P(\tilde{x} > 55) < 0.5$

Again, a one-sided test.

Sure. Here is the answer.

We can see the null distribution of the test statistic (median from the bootstrap samples) along with the basis value of 55.

86% of the test statistics are greater than this basis value. Hence, we cannot reject the null hypothesis.

The null distribution of the test statistic does not resemble any known distribution.

Yes. Since the bootstrap-based hypothesis test is distribution-free (non-parametric), not knowing the nature of the limiting distribution of the test statistic (median) does not restrain us.

Awesome. Let me also run the test for the standard deviation.

$H_{0}: P(\sigma > 16.5) = 0.5$

$H_{A}: P(\sigma > 16.5) \ne 0.5$

I am taking a two-sided test since a deviation in either direction, i.e., too small a standard deviation or too large of a standard deviation will disprove the hypothesis.

Here is the result.

The p-value is 0.85, i.e., 85% of the bootstrap-replicated test statistics are greater than 16.5. Since the p-value is greater than the acceptable rate of rejection, we cannot reject the null hypothesis.

If the p-value were less than 0.025 or greater than 0.975, then we would have rejected the null hypothesis.

For a p-value of 0.025, 97.5% of the bootstrap-replicated standard deviations will be less than 16.5 — strong evidence that the null distribution produces values much less than 16.5. For a p-value of 0.975, 97.5% of the bootstrap-replicated standard deviations will be greater than 16.5 — strong evidence that the null distribution produces values much greater than 16.5. In either of the sides, we reject the null hypothesis that the standard deviation is 16.5.

Let me complete the hypothesis test on the proportion.

$H_{0}: P(p > 0.2) = 0.5$

$H_{A}: P(p > 0.2) \ne 0.5$

Let’s take a two-sided test since deviation in either direction can disprove the null hypothesis. If we get a tiny proportion or a very high proportion compared to 0.2, we will reject the belief that the percentage of students obtaining a score of less than 40 is 0.2.

Here are the null distribution and the result from the test.

The p-value is 0.32. 3200 out of the 10000 bootstrap-replicated proportions are greater than 0.2. Since it is between 0.025 and 0.975, we cannot reject the null hypothesis.

You can see how widely the bootstrap concept can be applied for hypothesis testing and what flexibility it provides.

To summarize:

Repeatedly sample with replacement from the original sample data. 
Each time, draw a sample of size n.

Compute the desired statistic from each bootstrap sample.
(mean, median, standard deviation, interquartile range, 
 proportion, skewness, etc.)

Null hypothesis  can now be tested as follows:

 if , else, 

 
(average over all N bootstrap-replicated test statistics)

If  or , reject the null hypothesis 
(for a two-sided hypothesis test at a selected rejection rate of )

If , reject the null hypothesis 
(for a left-sided hypothesis test at a rejection rate of )

If , reject the null hypothesis 
(for a right-sided hypothesis test at a rejection rate of )