Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-Sample Hypothesis Tests – Part VII

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “Open Data for All New Yorkers” page.

Interestingly, I found data on the bicycle counts for East River Bridges. New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

They highlight all non-holiday weekdays with no precipitation in yellow.

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

We are looking at the boxplots that present a nice visual of the data range and its percentiles. And we can compare one sample to another. Remember Lesson 14? There is a vertical line at 6352 bikes, the maximum number of bikes on Manhattan Bridge during weekends, holidays, or rainy days — i.e., the non-highlighted days.

I want answers to the following questions.

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \bar{x}_{M}=\bar{x}_{W}?
Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. \tilde{x}_{M}=\tilde{x}_{W}?
Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. s^{2}_{M}=s^{2}_{W}?
Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge. IQR_{M}=IQR_{W}?
Is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge. P(M<6352)=P(W<6352) or p_{M}=p_{W}?

What do we know so far? 

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal (Lesson 94) or using Welch’s t-Test when we cannot assume equality of population variances (Lesson 95). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means (Lesson 96).

We know how to test the equality of variances using F-distribution (Lesson 97).

We know how to test the difference in proportions using either Fisher’s Exact test (Lesson 92) or using the normal distribution as the null distribution under the large-sample approximation (Lesson 93).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.

  • What is the limiting distribution of the test-statistic that computes the difference in medians?
  • What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
  • What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

The idea of the bootstrap is that we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely with a probability of 1/n, we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Take the data for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values — with replacement.

It is like playing the game of Bingo where the chips are these eight numbers. Each time we get a number, we put it back and roll it again until we draw eight numbers.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate.
6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate.
6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution \hat{f}, i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating \hat{f}, the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic. As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

Let’s now use the bootstrap method for two-sample hypothesis tests. Suppose there are two random variables, X and Y, and any statistic computed from them are \theta_{x} and \theta_{y}. We may have a sample of n_{1} values representing X and a sample of n_{2} values to represent Y.

\theta_{x},\theta_{y} can be mean, median, variance, proportion, etc. Any computable statistic from the original data is of the form \theta_{x},\theta_{y}.

The null hypothesis is that there is no difference between the statistic of X or Y.

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

The alternate hypothesis is

H_{A}: P(\theta_{x}>\theta_{y}) > 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) < 0.5

or

H_{A}: P(\theta_{x}>\theta_{y}) \neq 0.5

We first create a bootstrap replicate of X and Y by randomly drawing with replacement n_{1} values from X and n_{2} values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics \theta_{x} and \theta_{y} and check whether \theta_{x}>\theta_{y}. If yes, we register S_{i}=1. If not, we register S_{i}=0.

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

xboot: 6691 5311 6774 5311 6359 5311 5311 6052
yboot: 6775 6881 7341 7196 6775 7341 6775 7196

Mean of this bootstrap replicate for X and Y are 5890 and 7035. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

Another bootstrap replicate for X and Y may look like this:

xboot: 6774 6359 6359 6359 6052 5054 6052 6691
yboot: 6775 7196 7196 6026 6881 7341 6881 5711

Mean of this bootstrap replicate for X and Y are 6212.5 and 6750.875. Since \bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}, we register S_{i}=0

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics \theta_{x} and \theta_{y}, and verifying whether \theta_{x}>\theta_{y} and registering S_{i} \in (0,1) a large number of times, say N=10,000.

The proportion of times S_{i} = 1 in a set of N bootstrap-replicated statistics is the p-value.

p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}

Based on the p-value, we can use the rule of rejection at the selected rate of rejection \alpha.

For a one-sided alternate hypothesis, we reject the null hypothesis if p-value < \alpha or p-value > 1-\alpha.

For a two-sided alternate hypothesis, we reject the null hypothesis if p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}.

Manhattan Bridge vs. Williamsburg Bridge

Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5

H_{A}: P(\bar{x}_{M}> \bar{x}_{W}) \neq 0.5

Let’s take a two-sided alternate hypothesis.

Here is the null distribution of \frac{\bar{x}_{M}}{\bar{x}_{W}} for N = 10,000.

A vertical bar is shown at a ratio of 1 to indicate that the area beyond this value is the proportion of times S_{i} = 1 in a set of 10,000 bootstrap-replicated means.

The p-value is 0.0466. 466 out of the 10,000 bootstrap replicates had \bar{x}_{M}>\bar{x}_{W}. For a 10% rate of error (\alpha=10\%), we reject the null hypothesis since the p-value is less than 0.05. Since more than 95% of the times, the mean of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

Can we reject the null hypothesis if we select a 5% rate of error?



Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5

H_{A}: P(\tilde{x}_{M}> \tilde{x}_{W}) \neq 0.5

The null distribution of \frac{\tilde{x}_{M}}{\tilde{x}_{W}} for N = 10,000.

The p-value is 0.1549. 1549 out of the 10,000 bootstrap replicates had \tilde{x}{M}>\tilde{x}{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (84.51% of the times) that the median of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(s^{2}_{M}>s^{2}_{W}) = 0.5

H_{A}: P(s^{2}_{M}>s^{2}_{W}) \neq 0.5

The null distribution of \sqrt{\frac{s^{2}_{M}}{s^{2}_{W}}} for N = 10,000. We are looking at the null distribution of the ratio of the standard deviations.

The p-value is 0.4839. 4839 out of the 10,000 bootstrap replicates had s^{2}_{M}>s^{2}_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (51.61% of the times) that the variance of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?

H_{0}: P(IQR_{M}>IQR_{W}) = 0.5

H_{A}: P(IQR_{M}>IQR_{W}) \neq 0.5

The null distribution of \frac{IQR_{M}}{IQR_{W}} for N = 10,000. It does not resemble any known distribution, but that does not restrain us since the bootstrap-based hypothesis test is distribution-free.

The p-value is 0.5453. 5453 out of the 10,000 bootstrap replicates had IQR_{M}>IQR_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (45.47% of the times) that the interquartile range of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Finally, is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge?

H_{0}: P(p_{M}>p_{W}) = 0.5

H_{A}: P(p_{M}>p_{W}) \neq 0.5

The null distribution of \frac{p_{M}}{p_{W}} for N = 10,000.

The p-value is 0.5991. 5991 out of the 10,000 bootstrap replicates had p_{M}>p_{W}. For a 10% rate of error (\alpha=10\%), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (40.09% of the times) that the proportion of the total bike counts less than 6352 on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.



Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent. 

To summarize,

Repeatedly sample with replacement from original samples of X and Y -- N times.
Each time draw a sample of size n_{1} from X and a sample of size n_{2} from Y.
Compute the desired statistic (mean, median, skew, etc.) from each
bootstrap sample.
The null hypothesis P(\theta_{x}>\theta_{y})=0.5 can now be tested as follows:
  • S_{i}=1 if \theta_{x}>\theta_{y}, else, S_{i}=0
  • p-value=\frac{1}{N}\sum_{i=1}^{N}S_{i} (average over all N bootstrap-replicated statistics)
  • If p-value < \frac{\alpha}{2} or p-value > 1-\frac{\alpha}{2}, reject the null hypothesis for a two-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value < \alpha, reject the null hypothesis for a left-sided hypothesis test at a selected rejection rate of \alpha.
  • If p-value > 1-\alpha, reject the null hypothesis for a right-sided hypothesis test at a selected rejection rate of \alpha.

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)