Student’s T-distribution – dataanalysisclassroom

Lesson 99 – The Two-Sample Hypothesis Tests in R

Over the past seven lessons, we equipped ourselves with the necessary theory of the two-sample hypothesis tests.

Lessons 92 and 93 were about the hypothesis test on the difference in proportions. In Lesson 92, we learned Fisher’s Exact Test to verify the difference in proportions. In Lesson 93, we did the same using the normal distribution as a limiting distribution when the sample sizes are large.

Lessons 94, 95, and 96 were about the hypothesis test on the difference in means. In Lesson 94, we learned that under the proposition that the population variances of two random variables are equal, the test-statistic, which uses the pooled variance, follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. In Lesson 95, we learned that Welch’s t-Test could be used when we cannot make the equality of population proportions assumption. In Lesson 96, we learned the Wilcoxon’s Rank-sum Test that used a ranking approach to approximate the significance of the differences in means. This method is data-driven, needs no assumptions on the limiting distribution, and works well for small sample sizes.

Lesson 97 was about the hypothesis test on the equality of variances. Here we got a sneak-peak into a new distribution called F-distribution, which is the converging distribution of the ratio of two Chi-square distributions divided by their respective degrees of freedom. The test-statistic is the ratio of the sample variances, which we can verify against an F-distribution with $n_{1}-1$ numerator degrees of freedom and $n_{2}-1$ denominator degrees of freedom.

All these hypothesis tests can also be done using the bootstrap approach, which we learned in Lesson 98. It uses the data at hand to generate the null distribution of any desired statistic as long as it is computable from the data. It is very flexible. There is no need to make any assumptions on the data’s distributional nature or the limiting distribution for the test-statistic.

It is now time to put all of this learning to practice.

The City of New York Department of Sanitation (DSNY) archives data on the monthly tonnage of trash collected from NYC residences and institutions.

To help us with today’s workout, I dug up our gold mine, the “Open Data for All New Yorkers” page, and found an interesting dataset on the monthly tonnage of trash collected from NYC residences and institutions.

Here is a preview.

The data includes details on when (year and month) DSNY collected the trash, from where (borough and community district), and how much (tonnage of refuse, source-separated recyclable paper, and source-separated metal, glass, plastic, and beverage cartons). They also have other seasonal data such as the tonnage of leaves collected in November and December and tons of Christmas trees collected from curbside in January.

For our workout today, we can use this file named “DSNY_Monthly_Tonnage_Data.csv.” It is a slightly modified version of the file you will find from the open data page.

My interest is in Manhattan’s community district 9 — Morningside Heights, Manhattanville, and Hamilton Heights.

I want to compare the tonnage of trash collected in winter to that in summer. Let’s say we compare the data in Manhattan’s community district 9 for February and August to keep it simple.

Are you ready?

The Basic Steps

Step1: Get the data

You can get the data from here. The file is named DSNY_Monthly_Tonnage_Data.csv. It is a comma-separated values file which we can conveniently read into the R workspace.

Step 2: Create a new folder on your computer

And call this folder “lesson99.”
Make sure that the data file “DSNY_Monthly_Tonnage_Data.csv” is in this folder.

Step 3: Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson99” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson99_code.R”

Step 4: Choose your working directory

“lesson99” is the folder where we stored the code and the input data file. Use setwd("path") to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Or, if you opened this “lesson99_code.R” file from within your “lesson99” folder, you can also use the following line to set your working directory.

setwd(getwd())

getwd() gets the current working directory. If you opened the file from within your director, then getwd() will resort to the working directory you want to set.

Step 5: Read the data into R workspace

Execute the following command to read your input .csv file.

# Read the data file
nyc_trash_data = read.csv("DSNY_Monthly_Tonnage_Data.csv",header=T)

Since the input file has a header for each column, we should have header=T in the command when we read the file to ensure that all the rows are correctly read.

Your RStudio interface will look like this when you execute this line and open the file from the Global Environment.

Step 6: Extracting a subset of the data

As I said before, for today’s workout, let’s use the data from Manhattan’s community district 9 for February and August. Execute the following lines to extract this subset of the data.

# Extract February refuse data for Manhattan's community district 9 
feb_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==2))

feb_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[feb_index]

# Extract August refuse data for Manhattan's community district 9
aug_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==8))

aug_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[aug_index]

In the first line, we look up for the row index of the data corresponding to Borough=Manhattan, Community District = 9, and Month = 2.

In the second line, we extract the data on refuse tons collected for these rows.

The next two lines repeat this process for August data.

Sample 1, February data for Manhattan’s community district 9 looks like this:
2953.1, 2065.8, 2668.2, 2955.4, 2799.4, 2346.7, 2359.6, 2189.4, 2766.1, 3050.7, 2175.1, 2104.0, 2853.4, 3293.2, 2579.1, 1979.6, 2749.0, 2871.9, 2612.5, 455.9, 1951.8, 2559.7, 2753.8, 2691.0, 2953.1, 3204.7, 2211.6, 2857.9, 2872.2, 1956.4, 1991.3

Sample 2, August data for Manhattan’s community district 9 looks like this:
2419.1, 2343.3, 3032.5, 3056.0, 2800.7, 2699.9, 3322.3, 3674.0, 3112.2, 3345.1, 3194.0, 2709.5, 3662.6, 3282.9, 2577.9, 3179.1, 2460.9, 2377.1, 3077.6, 3332.8, 2233.5, 2722.9, 3087.8, 3353.2, 2992.9, 3212.3, 2590.1, 2978.8, 2486.8, 2348.2

The values are in tons per month, and there are 31 values for sample 1 and 30 values for sample 2. Hence, $n_{1}=31$ and $n_{2}=30$ .

The Questions for Hypotheses

Let’s ask the following questions.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?

We can visualize their distributions. Execute the following lines to create a neat graphic of the boxplots of sample 1 and sample 2.

# Visualizing the distributions of the two samples
boxplot(cbind(feb_refuse_data,aug_refuse_data),horizontal=T, main="Refuse from Manhattan's Community District 9")
text(1500,1,"February Tonnage",font=2)
text(1500,2,"August Tonnage",font=2)

p_threshold = 2500 # tons of refuse
abline(v=p_threshold,lty=2,col="maroon")

You should now see your plot space changing to this.

Next, you can execute the following lines to compute the preliminaries required for our tests.

# Preliminaries
  # sample sizes
  n1 = length(feb_refuse_data)
  n2 = length(aug_refuse_data)
  
  # sample means
  x1bar = mean(feb_refuse_data)
  x2bar = mean(aug_refuse_data)

  # sample variances
  x1var = var(feb_refuse_data)
  x2var = var(aug_refuse_data)

  # sample proportions
  p1 = length(which(feb_refuse_data < p_threshold))/n1
  p2 = length(which(aug_refuse_data < p_threshold))/n2

Hypothesis Test on the Difference in Means

We learned four methods to verify the hypothesis on the difference in means:

The two-sample t-Test
Welch’s two-sample t-Test
Wilcoxon’s Rank-sum Test
The Bootstrap Test

Let’s start with the two-sample t-Tests. Assuming a two-sided alternative, the null and alternate hypothesis are:

$H_{0}: \mu_{1} - \mu_{2}=0$

$H_{A}: \mu_{1} - \mu_{2} \neq 0$

The alternate hypothesis indicates that substantial positive or negative differences will allow the rejection of the null hypothesis.

We know that for the two-sample t-Test, we assume that the population variances are equal, and the test-statistic is $t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ , where $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ is the pooled variance. $t_{0}$ follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

Execute the following lines in R.

pooled_var = ((n1-1)/(n1+n2-2))*x1var + ((n2-1)/(n1+n2-2))*x2var

t0 = (x1bar-x2bar)/sqrt(pooled_var*((1/n1)+(1/n2)))

df = n1+n2-2

pval = pt(t0,df=df)

print(pooled_var)
print(df)
print(t0)
print(pval)

We first computed the pooled variance $s^{2}$ and, using it, the test-statistic $t_{0}$ . Then, based on the degrees of freedom, we compute the p-value.

The p-value is 0.000736. For a two-sided test, we compare this with $\frac{\alpha}{2}$ . If we opt for a rejection rate $\alpha$ of 5%, then, since our p-value of 0.000736 is less than 0.025, we reject the null hypothesis that the means are equal.

All these steps can be implemented using a one-line command in R.

Try this:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = TRUE)

You will get the following prompt in the console.

Two Sample t-test

data: feb_refuse_data and aug_refuse_data

t = -3.3366, df = 59, p-value = 0.001472

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-658.2825 -164.7239

sample estimates:

mean of x mean of y

2510.697 2922.200

As inputs of the function, we need to provide the data (sample 1 and sample 2). We should also indicate that we evaluate a two-sided alternate hypothesis and that the population variances are equal. This will prompt the function to execute the standard two-sample t-test.

While the function provides additional information, all we need for our test are pointers from the first line. t=-3.3366 and df=59 are the test-statistic and the degrees of freedom that we computed earlier. The p-value looks different than what we computed. It is two times the value we estimated. The function provides a value that is double that of the original p-value to allow us to compare it to $\alpha$ instead of $\frac{\alpha}{2}$ . Either way, we know that we can reject the null hypothesis.

Next, let’s implement Welch’s t-Test. When the population variances are not equal, i.e., when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ , the test-statistic is $t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ , and it follows an approximate T-distribution with f degrees of freedom which can be estimated using the Satterthwaite’s equation: $f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}$

Execute the following lines.

f = (((x1var/n1)+(x2var/n2))^2)/(((x1var/n1)^2/(n1-1))+((x2var/n2)^2/(n2-1)))

t0 = (x1bar-x2bar)/sqrt((x1var/n1)+(x2var/n2))

pval = pt(t0,df=f)

print(f)
print(t0)
print(pval)

The pooled degrees of freedom is 55, the test-statistic is -3.3528, and the p-value is 0.000724. Comparing this with 0.025, we can reject the null hypothesis.

Of course, all this can be done using one line where you indicate that the variances are not equal:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = FALSE)

When you execute the above line, you will see the following on your console.

Welch Two Sample t-test
data: feb_refuse_data and aug_refuse_data
t = -3.3528, df = 55.338, p-value = 0.001448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-657.4360 -165.5705
sample estimates:
mean of x mean of y
2510.697 2922.200

Did you notice that it runs Welch’s two-sample t-Test when we indicate that the variances are not equal? Like before, we can compare the p-value to 0.05 and reject the null hypothesis.

The mean of the February refuse collected from Manhattan’s community district 9 is different from August. Since the test-statistic is negative and the p-value is lower than the chosen level of rejection rate, we can conclude that the mean of February refuse tonnage is significantly lower than the mean of August refuse tonnage beyond a reasonable doubt.

How about Wilcoxon’s Rank-sum Test? The null hypothesis is based on the proposition that if $x_{1}$ and $x_{2}$ are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

$H_{0}: P(x_{1} > x_{2}) = 0.5$

$H_{0}: P(x_{1} > x_{2}) \neq 0.5$

We know that this ranking and coming up with the rank-sum table is tedious for larger sample sizes. For larger sample sizes, the null distribution of the test-statistic W, which is the sum of the ranks associated with the variable of smaller sample size in the pooled and ordered data, tends to a normal distribution. This allows the calculation of the p-value by comparing W to a normal distribution. We can use the following command in R to implement this.

wilcox.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will get the following message in the console when you execute the above line.

Wilcoxon rank sum test with continuity correction
data: feb_refuse_data and aug_refuse_data
W = 248, p-value = 0.001788
alternative hypothesis: true location shift is not equal to 0

Since the p-value of 0.001788 is lower than 0.05, the chosen level of rejection, we reject the null hypothesis that $x_{1}$ and $x_{2}$ are samples from the same distribution.

Finally, let’s take a look at the bootstrap method. The null hypothesis is that there is no difference between the means.

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) = 0.5$

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) \neq 0.5$

We first create a bootstrap replicate of X and Y by randomly drawing with replacement $n_{1}$ values from X and $n_{2}$ values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ and check whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ , and verifying whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ and registering $S_{i} \in (0,1)$ a large number of times, say N=10,000.

The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated statistics is the p-value.

Execute the following lines in R to implement these steps.

#4. Bootstrap
  N = 10000
  null_mean = matrix(0,nrow=N,ncol=1)
  null_mean_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_mean_ratio[i] = mean(xboot)/mean(yboot) 
     if(mean(xboot)>mean(yboot)){null_mean[i]=1} 
   }
 
 pvalue_mean = sum(null_mean)/N
 hist(null_mean_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="Xbar/Ybar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(0.95,1000,paste("p-value=",pvalue_mean),col="red")

Your RStudio interface will then look like this:

A vertical bar will show up at a ratio of 1 to indicate that the area beyond this value is the proportion of times $S_{i} = 1$ in 10,000 bootstrap-replicated means.

The p-value is close to 0. We reject the null hypothesis since the p-value is less than 0.025. Since more than 97.5% of the times, the mean of February tonnage is less than August tonnage; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

In summary, while the standard t-Test, Welch's t-Test, and Wilcoxon's Rank-sum Test can be implemented in R using single-line commands,

t.test(x1,x2,alternative="two.sided",var.equal = TRUE)
t.test(x1,x2,alternative="two.sided",var.equal = FALSE)
wilcox.test(x1,x2,alternative = "two.sided")

the bootstrap test needs a few lines of code.

Hypothesis Test on the Equality of Variances

We learned two methods to verify the hypothesis on the equality of variances; using F-distribution and using the bootstrap method.

For the F-distribution method, the null and the alternate hypothesis are

$H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}$

$H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}$

The test-statistic is the ratio of the sample variances, $f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}$

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic $f_{0}$ in the null distribution.

Execute the following lines.

# 1. F-Test
  f0 = x1var/x2var
  
  df_numerator = n1-1
  
  df_denominator = n2-1
  
  pval = 1-pf(f0,df1=df_numerator,df2=df_denominator)
  
  print(f0)
  print(df_numerator)
  print(df_denominator)
  print(pval)

The test-statistic, i.e., the ratio of the sample variances, is 1.814.

The p-value, which, in this case, is the probability of finding a value greater than the test-statistic ( $P(F > f_{0})$ ), is 0.0562. When compared to a one-sided rejection rate ( $\frac{\alpha}{2}$ ) of 0.025, we cannot reject the null hypothesis.

We do have a one-liner command.

var.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will find the following message when you execute the var.test() command.

F test to compare two variances
data: feb_refuse_data and aug_refuse_data
F = 1.814, num df = 30, denom df = 29, p-value = 0.1125
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8669621 3.7778616
sample estimates:
ratio of variances
1.813959

As usual, with the standard R function, we need to compare the p-value (which is two times what we computed based on the right-tail probability) with $\alpha=0.05$ . Since 0.1125 > 0.05, we cannot reject the null hypothesis that the variances are equal.

Can we come to the same conclusion using the bootstrap test? Let’s try. Execute the following lines in R.

# 2. Bootstrap
 N = 10000
   null_var = matrix(0,nrow=N,ncol=1)
   null_var_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_var_ratio[i] = var(xboot)/var(yboot) 
     if(var(xboot)>var(yboot)){null_var[i]=1} 
   }
 
 pvalue_var = sum(null_var)/N
 hist(null_var_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="XVar/YVar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,500,paste("p-value=",pvalue_var),col="red")

Your RStudio interface should look like this if you correctly coded up and executed the lines. The code provides a way to visualize the null distribution.

The p-value is 0.7945. 7945 out of the 10,000 bootstrap replicates had $s^{2}_{Feb}>s^{2}_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.025. The evidence (20.55% of the times) that the February tonnage variance is less than August tonnage is not sufficient to reject equality.

In summary, use var.test(x1,x2,alternative = "two.sided") if you want to verify based on F-distribution, or code up a few lines if you want to verify using the bootstrap method.

Hypothesis Test on the Difference in Proportions

For the hypothesis test on the difference in proportions, we can employ Fisher’s Exact Test, use the normal approximation under the large-sample assumption, or the bootstrap method.

For Fisher’s Exact Test, the test-statistic follows a hypergeometric distribution when $H_{0}$ is true. We could assume that the number of successes is fixed at $t=x_{1}+x_{2}$ , and, for a fixed value of $t$ , we reject $H_{0}:p_{1}=p_{2}$ for the alternate hypothesis $H_{A}:p_{1}>p_{2}$ if there are more successes in random variable $X_{1}$ compared to $X_{2}$ .

The p-value can be derived under the assumption that the number of successes $X=k$ in the first sample $X_{1}$ has a hypergeometric distribution when $H_{0}$ is true and conditional on a total number of t successes that can come from any of the two random variables $X_{1}$ and $X_{2}$ .

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Execute the following lines to implement this process, plot the null distribution and compute the p-value.

#Fisher's Exact Test
  x1 = length(which(feb_refuse_data < p_threshold))
  p1 = x1/n1
  
  x2 = length(which(aug_refuse_data < p_threshold))
  p2 = x2/n2
  
  N = n1+n2
  t = x1+x2
  
  k = seq(from=0,to=n1,by=1)
  p = k
  for(i in 1:length(k)) 
   {
    p[i] = (choose(t,k[i])*choose((N-t),(n1-k[i])))/choose(N,n1)
   }

  plot(k,p,type="h",xlab="Number of successes in X1",ylab="P(X=k)",font=2,font.lab=2)
  points(k,p,type="o",lty=2,col="grey50")
  points(k[13:length(k)],p[13:length(k)],type="o",col="red",lwd=2)
  points(k[13:length(k)],p[13:length(k)],type="h",col="red",lwd=2)
  pvalue = sum(p[13:length(k)])
  print(pvalue)

Your RStudio should look like this once you run these lines.

The p-value is 0.1539. Since it is greater than the rejection rate, we cannot reject the null hypothesis that the proportions are equal.

We can also leverage an in-built R function that does the same. Try this.

fisher_data = cbind(c(x1,x2),c((n1-x1),(n2-x2)))
fisher.test(fisher_data,alternative="greater")

In the first line, we create a 2×2 contingency table to indicate the number of successes and failures in each sample. Sample 1 has 12 successes, i.e., 12 times the tonnage in February was less than 2500 tons. Sample 2 has seven successes. So the contingency table looks like this:

    [,1]  [,2]
[1,] 12    19
[2,]  7    23

Then, fisher.test() funtion implements the hypergeometric distribution calculations.

You will find the following prompt in your console.

Fisher’s Exact Test for Count Data
data: fisher_data
p-value = 0.1539
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.712854 Inf
sample estimates:
odds ratio
2.050285

We get the same p-value as before — we cannot reject the null hypothesis.

Under the normal approximation for large sample sizes, the test-statistic $z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)$ . $p$ is the pooled proportion which can be compute as $p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}$ .

We reject the null hypothesis when the p-value $P(Z \ge z)$ is less than the rate of rejection $\alpha$ .

Execute the following lines to set this up in R.

# Z-approximation
  p = (x1+x2)/(n1+n2)
  z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2)))
  pval = 1-pnorm(z)

The p-value is 0.0974. Since it is greater than 0.05, we cannot reject the null hypothesis.

Finally, we can use the bootstrap method as follows.

# Bootstrap
 N = 10000
 null_prop = matrix(0,nrow=N,ncol=1)
 null_prop_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
 
     p1boot = length(which(xboot < p_threshold))/n1 
     p2boot = length(which(yboot < p_threshold))/n2 

     null_prop_ratio[i] = p1boot/p2boot 
    if(p1boot>p2boot){null_prop[i]=1} 
  }
 
 pvalue_prop = sum(null_prop)/N
 hist(null_prop_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="P1/P2",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,250,paste("p-value=",pvalue_prop),col="red")

We evaluated whether the low trash situation in February is different than August.

$H_{0}: P(p_{Feb}>p_{Aug}) = 0.5$

$H_{A}: P(p_{Feb}>p_{Aug}) > 0.5$

The p-value is 0.8941. 8941 out of the 10,000 bootstrap replicates had $p_{Feb}>p_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is not greater than 0.95. The evidence (89.41% of the times) that the low trash situation in February is greater than August is insufficient to reject equality.

Your Rstudio interface will look like this.

In summary, 
use fisher.test(contingency_table,alternative="greater") 
if you want to verify based on Fisher's Exact Test; 

use p = (x1+x2)/(n1+n2); 
    z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2))); 
    pval = 1-pnorm(z) 
if you want to verify based on the normal approximation; 

code up a few lines if you want to verify using the bootstrap method.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
- Reject the null hypothesis. They are different.
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
- Cannot reject the null hypothesis. They are probably the same.
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?
- Cannot reject the null hypothesis. They are probably the same.

Please take the next two weeks to digest all the two-sample hypothesis tests, including how to execute them in R.

Here is the full code for today’s lesson.

In a little over four years, we reached 99 lessons 😎

I will have something different for the 100. Stay tuned …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 95 – The Two-Sample Hypothesis Test – Part IV

On the Difference in Means

using Welch’s t-Test

$H_{0}: \mu_{1} - \mu_{2} = 0$

$H_{A}: \mu_{1} - \mu_{2} > 0$

$H_{A}: \mu_{1} - \mu_{2} < 0$

$H_{A}: \mu_{1} - \mu_{2} \neq 0$

On the 24th day of January 2021, we examined Tom’s hypothesis on the Mohawk Rivers’ arsenic levels.

After a lengthy expose on the fundamentals behind hypothesis testing on the difference in means using a two-sample t-Test, we concluded that Tom could not reject the null hypothesis $H_{0}: \mu_{1}-\mu_{2}=0$ .

He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

However, Tom now has a new theory that the factory is illegally dumping their untreated waste into both the west and the east branches of the Mohawk River. So, he took Ron with him to collect data from Utica River, a tributary of Mohawk that branches off right before the factory.

If his new theory is correct, he should find that the mean arsenic concentration in either west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

There are now three samples whose concentration in parts per billion are:

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
$n_{1}=10$ | $\bar{x_{1}} = 12.5$ | $s_{1}^{2} = 58.28$

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
$n_{2}=10$ | $\bar{x_{2}} = 12$ | $s_{2}^{2} = 54.89$

Utica River: 4, 4, 6, 4, 5, 7, 8
$n_{3}=7$ | $\bar{x_{3}} = 5.43$ | $s_{3}^{2} = 2.62$

Were you able to help Tom with his new hypothesis?

In his first hypothesis test, since the sample variances were close to each other (58.28 and 54.89), we assumed that the population variances are equal and proceeded with a t-Test.

Under the proposition that the population variance of two random variables $X_{1}$ and $X_{2}$ are equal, i.e., $\sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}$ , the test-statistic is $t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ , where $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ is the pooled variance. $t_{0}$ follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom.

But, can we make the same assumption in the new case? The Utica River sample has a sample variance $s_{3}^{2}$ of 2.62. It will not be reasonable to assume that the population variances are equal.

How do we proceed when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ ?

Let’s go back a few steps and outline how we arrived at the test-statistic.

The hypothesis test is on the difference in means: $\mu_{1} - \mu_{2}$

A good estimator of the difference in population means is the difference in sample means: $y = \bar{x_{1}} - \bar{x_{2}}$

The expected value of y, $E[y]$ , is $\mu_{1}-\mu_{2}$ , and the variance of y, $V[y]$ , is $\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}$ .

Since $y \sim N(\mu_{1}-\mu_{2},\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}})$ , its standardized version $z = \frac{y-(\mu_{1}-\mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}}$ is the starting point to deduce the test-statistic.

This statistic reduces to $z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}}$ under the null hypothesis that $\mu_{1}-\mu_{2}=0$ .

Last week, we entertained the idea that we are comparing the difference in the means of two populations whose variance is equal and reasoned that the test-statistic follows a T-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

This is because the pooled population variance $\sigma^{2}$ can be replaced by its unbiased estimator $s^{2}$ , which, in turn is related to a Chi-squared distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

When the population variance is not equal, i.e., when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ , there is no pooled variance which can be related to the Chi-square distribution.

The best estimate of $V[y]$ is obtained by replacing the individual population variances ( $\sigma_{1}^{2}, \sigma_{2}^{2}$ ) by the sample variances ( $s_{1}^{2}, s_{2}^{2}$ ).

Hence,

$z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$

We should now identify the limiting distribution of $V[y]$

Bernard Lewis Welch, a British statistician, in his works in 1936, 1938, and 1947 explained that, with some adjustments, $V[y]$ can be approximated to a Chi-square distribution, and hence the test-statistic $\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ can be appriximated to a T-distribution.

Let’s digest the salient points of his work

Assume $\lambda_{1} = \frac{1}{n_{1}}$ and $\lambda_{2} = \frac{1}{n_{2}}$

Assume $f_{1} = (n_{1}-1)$ and $f_{2} = (n_{2}-1)$

An estimate for the variance of y is $\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2}$

Now, in Lesson 73, we learned that $\frac{(n-1)s^{2}}{\sigma^{2}}$ is a Chi-square distribution. Based on this logic, we can write

$s^{2} = \frac{1}{n-1}\sigma^{2}*\chi^{2}$

So, $\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2}$ is of the form,

$\lambda_{1}\frac{1}{n_{1}-1}\sigma_{1}^{2}\chi_{1}^{2}+\lambda_{2}\frac{1}{n_{2}-1}\sigma_{2}^{2}\chi_{2}^{2}$

or,

$\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2} = a\chi_{1}^{2}+b\chi_{2}^{2}$

$a = \frac{\lambda_{1}\sigma_{1}^{2}}{f_{1}}$ | $b = \frac{\lambda_{2}\sigma_{2}^{2}}{f_{2}}$

Welch showed that if $z = a\chi_{1}^{2}+b\chi_{2}^{2}$ , then, the distribution of z can be approximated using a Chi-square distribution with a random variable $\chi=\frac{z}{g}$ and f degrees of freedom.

He found the constants f and g by equating the moments of z with the moments of this Chi-square distribution, i.e., the Chi-square distribution where the random variable $\chi$ is $\frac{z}{g}$ .

This is how he finds f and g

The first and the second moments of a general Chi-square distribution are the degrees of freedom (f) and two times the degrees of freedom (2f).

The random variable we considered is $\chi=\frac{z}{g}$ .

So,

$E[\frac{z}{g}] = f$ | $V[\frac{z}{g}] = 2f$

Since g is a constant that needs to be estimated, we can reduce these equations as

$\frac{1}{g}E[z] = f$ | $\frac{1}{g^{2}}V[z] = 2f$

Hence,

$E[z] = gf$ | $V[z] = 2g^{2}f$

Now, let’s take the equation $z = a\chi_{1}^{2}+b\chi_{2}^{2}$ and find the expected value and the variance of z.

$E[z] = aE[\chi_{1}^{2}]+bE[\chi_{2}^{2}]$

$E[z] = af_{1}+bf_{2}$

$V[z] = a^{2}V[\chi_{1}^{2}]+b^{2}V[\chi_{2}^{2}]$

$V[z] = a^{2}2f_{1}+b^{2}2f_{2}=2a^{2}f_{1}+2b^{2}f_{2}$

Now, he equates the moments derived using the equation for z to the moments derived from the Chi-square distribution of $\frac{z}{g}$

Equating the first moments: $gf=af_{1}+bf_{2}$

Equating the second moments: $2g^{2}f = 2a^{2}f_{1}+2b^{2}f_{2}$

The above equation can be written as

$g*(gf) = a^{2}f_{1}+b^{2}f_{2}$

or,

$g*(af_{1}+bf_{2}) = a^{2}f_{1}+b^{2}f_{2}$

From here,

$g = \frac{a^{2}f_{1}+b^{2}f_{2}}{af_{1}+bf_{2}}$

Using this with $gf=af_{1}+bf_{2}$ , we can obtain f.

$f = \frac{(af_{1}+bf_{2})^{2}}{a^{2}f_{1}+b^{2}f_{2}}$

Since we know the terms for a and b, we can say,

$f = \frac{(\lambda_{1}\sigma_{1}^{2}+\lambda_{2}\sigma_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}\sigma_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}\sigma_{2}^{4}}{f_{2}}}$

Since the variance of y follows an approximate Chi-square distribution with f degrees of freedom, we can assume that the test-statistic $\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ follows an approximate T-distribution with f degrees of freedom.

Welch also showed that an unbiased estimate for f is

$f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}+2}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}+2}} - 2$

Essentially, he replaces $s_{1}^{2}$ for $\sigma_{1}^{2}$ , and $s_{2}^{2}$ for $\sigma_{2}^{2}$ , and to correct for the bias, he adds 2 to the degrees of freedom in the denominator, and substracts an overall 2 from this fraction. He argues that this correction produces the best unbiased estimate for f.

Later authors like Franklin E. Satterthwaite showed that the bias correction might not be necessary, and it would suffice to use $s_{1}^{2}$ for $\sigma_{1}^{2}$ , and $s_{2}^{2}$ for $\sigma_{2}^{2}$ in the original equation, as in,

$f = \frac{(\lambda_{1}s_{1}^{2}+\lambda_{2}s_{2}^{2})^{2}}{\frac{\lambda_{1}^{2}s_{1}^{4}}{f_{1}}+\frac{\lambda_{2}^{2}s_{2}^{4}}{f_{2}}}$

Since we know that $\lambda_{1} = \frac{1}{n_{1}}$ , $\lambda_{2} = \frac{1}{n_{2}}$ , $f_{1} = (n_{1}-1)$ , and $f_{2} = (n_{2}-1)$ , we can finally say

When the population variances are not equal, i.e., when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ , the test-statistic is $t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ , and it follows an approximate T-distribution with f degrees of freedom.

The degrees of freedom can be estimated as

$f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2$

or,

$f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}$

This is now popularly known as Welch's t-Test.

Let’s now go back to Tom and help him with his new theory.

We will compare the west branch Mohawk River with the Utica River.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
$n_{1}=10$ | $\bar{x_{1}} = 12.5$ | $s_{1}^{2} = 58.28$

Utica River: 4, 4, 6, 4, 5, 7, 8
$n_{3}=7$ | $\bar{x_{3}} = 5.43$ | $s_{3}^{2} = 2.62$

Since we cannot assume that the population variances are equal, we will use Welch’s t-Test.

We compute the test-statistic and check how likely it is to see such a value in a T-distribution (approximate null distribution) with so many degrees of freedom.

$t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{3}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{3}^{2}}{n_{3}}}}$

$t_{0}^{*}=\frac{12.5-5.43}{\sqrt{\frac{58.28}{10}+\frac{2.62}{7}}}$

$t_{0}^{*}=2.84$

Let’s compute the bias-corrected degrees of freedom suggested by Welch.

$f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1) + 2}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)+2}} - 2$

$f = \frac{(\frac{58.28}{10}+\frac{2.62}{7})^{2}}{\frac{(58.28/10)^{2}}{(10 - 1) + 2}+\frac{(2.62/7)^{2}}{(7-1)+2}} - 2$

$f = 10.38$

We can round it down to 10 degrees of freedom.

The test-statistic is 2.84. Since the alternate hypothesis is that the difference is greater than zero, Tom has to verify how likely it is to see a value greater than 2.84 in the approximate null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection.

Look at this visual.

The distribution is an approximate T-distribution with 10 degrees of freedom. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.37.

1.37 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with ten degrees of freedom.

If the test statistic ( $t_{0}^{*}$ ) is greater than $t_{critical}$ , which is 1.37, he will reject the null hypothesis. At that point (i.e., at values greater than 1.37), there would be sufficient confidence to say that the difference is significantly greater than zero.

It is equivalent to rejecting the null hypothesis if $P(T > t_{0}^{*})$ (the p-value) is less than $\alpha$

We can read $t_{critical}$ off the standard T-table, or we can compute $P(T > t_{0}^{*})$ from the distribution.

At ten degrees of freedom (df=10), and $\alpha=0.1$ , $t_{critical}=1.372$ and $P(T>t_{0^{*}})=0.009$ .

Since the test-statistic $t_{0}^{*}$ is in the rejection region, or since $p-value < \alpha$ , Tom can reject the null hypothesis $H_{0}$ that $\mu_{1}-\mu_{2}=0$ .

He now has evidence beyond statistical doubt to claim that the factory is illegally dumping their untreated waste into the west branch Mohawk River.

Is it time for a lawsuit?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 94 – The Two-Sample Hypothesis Test – Part III

On the Difference in Means

using the t-Test

$H_{0}: \mu_{1} - \mu_{2} = 0$

$H_{A}: \mu_{1} - \mu_{2} > 0$

$H_{A}: \mu_{1} - \mu_{2} < 0$

$H_{A}: \mu_{1} - \mu_{2} \neq 0$

Tom grew up in the City of Mohawk, the land of natural springs, known for its pristine water. Tom’s house is alongside the west branch of Mohawk, one such pristine river. Every once in a while, Tom and his family go to the nature park on the banks of Mohawk. It is customary for Tom and his children to take a swim.

Lately, he has been reading in the local newspapers that the rivers’ arsenic levels have increased. Tom starts associating the cause of the alleged arsenic increases to this new factory in his neighborhood just upstream of Mohawk. They could be illegally dumping their untreated waste into the west branch.

He decided to test the waters of the west branch and the east branch of the Mohawk River.

His buddy Ron, an environmental engineer, would help him with the laboratory testing and the likes.

Over the next ten days, Tom and Ron collected water samples from the west and east branches, and Ron got them tested in his lab for arsenic concentration.

In parts per billion, the sample data looked like this.

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5

If Tom’s theory is correct, they should find the average arsenic concentration in the west branch to be significantly greater than the average arsenic concentration in the east branch.

How can Tom test his theory?

You are right!

He can use the hypothesis testing framework and verify if there is evidence beyond a statistical doubt.

Tom establishes the null and alternate hypotheses. He assumes that the factory does not illegally dump their untreated waste into the west branch, so the average arsenic concentration in the west branch should be equal to the average arsenic concentration in the east branch Mohawk River. Or, the difference in their means is zero.

$H_{0}: \mu_{1} - \mu_{2} = 0$

Against this null hypothesis, he pits his theory that they are indeed illegally dumping their untreated waste. So, the average arsenic concentration in the west branch should be greater than the average arsenic concentration in the east branch Mohawk River — the difference in their means is greater than zero.

$H_{A}: \mu_{1} - \mu_{2} > 0$

The alternate hypothesis is one-sided. A significant positive difference needs to be seen to reject the null hypothesis.

Tom is taking a 10% risk of rejecting the null hypothesis; $\alpha = 0.1$ . His Type I error is 10%.

Suppose the factory does not affect the water quality, but, the ten samples he collected showed a difference in the sample means much greater than zero, he should reject the null hypothesis. So he is committing an error (Type I error) in his decision making.

You must be knowing that there is a certain level of subjectivity in the choice of $\alpha$ .

Tom may want to prove that this factory is the leading cause for the increased arsenic levels in the west branch. So he could have chosen to commit a greater error of rejecting the null hypothesis, i.e., he must be inclined to selecting a larger value of $\alpha$ .

Someone who represents the factory management would be inclined to selecting a smaller value for $\alpha$ as that means less likely to reject the null hypothesis.

So, assuming that the null hypothesis is true, the decision to reject or not to reject is based on the value one chooses for $\alpha$ .

Anyhow, now that the basic testing framework is set up, let’s look at what Tom needs.

He needs a test-statistic to represent the difference in the means of two samples.

He needs the null distribution that this test-statistic converges to. In other words, he needs a probability distribution of the test-statistic to verify his null hypothesis -- how likely it is to see a value as large (or greater) as the test-statistic in the null distribution.

Let’s take Tom on a mathematical excursion

There are two samples represented by random variables $X_{1}$ and $X_{2}$ .

The mean and variance of $X_{1}$ are $\mu_{1}$ and $\sigma_{1}^{2}$ . We have one sample of size $n_{1}$ from this population. Suppose the sample mean and the sample variance are $\bar{x_{1}}$ and $s_{1}^{2}$ .

The mean and variance of $X_{2}$ are $\mu_{2}$ and $\sigma_{2}^{2}$ . We have one sample of size $n_{2}$ from this population. Suppose the sample mean and the sample variance are $\bar{x_{2}}$ and $s_{2}^{2}$ .

The hypothesis test is on the difference in means; $\mu_{1} - \mu_{2}$

Naturally, a good estimator of the difference in population means ( $\mu_{1}-\mu_{2}$ ) is the difference in sample means ( $\bar{x_{1}}-\bar{x_{2}}$ ).

$y = \bar{x_{1}}-\bar{x_{2}}$

If we know the probability distributions of $\bar{x_{1}}$ and $\bar{x_{2}}$ , we could perhaps infer the probability distribution of $y$ .

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. $E[\bar{x}] = \mu$ . We learned this in Lesson 67.

The variance of the sample mean ( $V[\bar{x}]$ ) is $\frac{\sigma^{2}}{n}$ . It indicates the spread around the center of the distribution. We learned this is Lesson 68.

Putting these two together, and with the central limit theorem, we can say $\bar{x} \sim N(\mu,\frac{\sigma^{2}}{n})$

So, for the two samples,

$\bar{x_{1}} \sim N(\mu_{1},\frac{\sigma_{1}^{2}}{n_{1}})$ | $\bar{x_{2}} \sim N(\mu_{2},\frac{\sigma_{2}^{2}}{n_{2}})$

If $\bar{x_{1}}$ and $\bar{x_{2}}$ are normal distributions, it is reasonable to assume that $y=\bar{x_{1}}-\bar{x_{2}}$ will be a normal distribution.

$y \sim N(E[y], V[y])$

We should see what $E[y]$ and $V[y]$ are.

Expected Value of y

$y = \bar{x_{1}} - \bar{x_{2}}$

$E[y] = E[\bar{x_{1}} - \bar{x_{2}}]$

$E[y] = E[\bar{x_{1}}] - E[\bar{x_{2}}]$

Since the expected value of the sample mean is the true population mean,

$E[y] = \mu_{1} - \mu_{2}$

Variance of y

$y = \bar{x_{1}} - \bar{x_{2}}$

$V[y] = V[\bar{x_{1}} - \bar{x_{2}}]$

$V[y] = V[\bar{x_{1}}] + V[\bar{x_{2}}]$ _{(can you tell why?)}

Since, the variance of the sample mean ( $V[\bar{x}]$ ) is $\frac{\sigma^{2}}{n}$ ,

$V[y] = \frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}$

Using the expected value and the variance of y, we can now say that

$y \sim N((\mu_{1} - \mu_{2}), (\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}))$

Or, if we standardize it,

$z = \frac{y-(\mu_{1} - \mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)$

At this point, it might be tempting to say that z is the test-statistic and the null distribution is the standard normal distribution.

Hold on!

We can certainly say that, under the null hypothesis, $\mu_{1} = \mu_{2}$ . So, z further reduces to

$z = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}} \sim N(0, 1)$

However, there are two unknowns here. The population variance $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ .

Think about making some assumptions about these unknowns

What could be a reasonable estimate for these unknowns?

You got it. The sample variance $s_{1}^{2}$ and $s_{2}^{2}$ .

Now, let’s take the samples that Tom and Ron collected and compute the sample mean and sample variance for each. The equations for these are anyhow at your fingertips!

West Branch: 3, 7, 25, 10, 15, 6, 12, 25, 15, 7
$\bar{x_{1}}=12.5$ | $s_{1}^{2}=58.28$

East Branch: 4, 6, 24, 11, 14, 7, 11, 25, 13, 5
$\bar{x_{2}}=12$ | $s_{2}^{2}=54.89$

Look at the value of the sample variance. 58.28 and 54.89. They seem close enough. Maybe, just maybe, could the variance of the samples be equal?

I am asking you to entertain the proposition that the population variance of the two random variables $X_{1}$ and $X_{2}$ are equal and that we are comparing the difference in the means of two populations whose variance is equal.

Say, $\sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}$

Our test-statistic will now reduce to

$\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{\sigma^{2}}{n_{1}} + \frac{\sigma^{2}}{n_{2}}}}$

$\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$

There is a common variance $\sigma^{2}$ , and it should suffice to come up with a reasonable estimate for this combined variance.

Let’s say that $s^{2}$ is a good estimate for $\sigma^{2}$ .

We call this the estimate for the pooled variance

If we have a formula for $s^{2}$ and if it is an unbiased estimate for $\sigma^{2}$ , then, we can replace $s^{2}$ for $\sigma^{2}$ .

What is the right equation for $s^{2}$ ?

Since $s^{2}$ is the estimate for the pooled variance, it will be reasonable to assume that it is some weighted average of the individual sample variances.

$s^{2} = w_{1}s_{1}^{2} + w_{2}s_{2}^{2}$

Let’s compute the expected value of $s^{2}$ .

$E[s^{2}] = E[w_{1}s_{1}^{2} + w_{2}s_{2}^{2}]$

$E[s^{2}] = w_{1}E[s_{1}^{2}] + w_{2}E[s_{2}^{2}]$

In Lesson 70, we learned that sample variance $s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$ is an unbiased estimate for population variance $\sigma^{2}$ . The factor $(n-1)$ in the denominator is to correct for a bias of $-\frac{\sigma^{2}}{n}$ .

An unbiased estimate means $E[s^{2}]=\sigma^{2}$ . If we apply this understanding to the pooled variance equation, we can see that

$E[s^{2}] = w_{1}\sigma_{1}^{2} + w_{2}\sigma_{2}^{2}$

Since $\sigma_{1}^{2}=\sigma_{2}^{2} = \sigma^{2}$ , we get

$E[s^{2}] = w_{1}\sigma^{2} + w_{2}\sigma^{2}$

$E[s^{2}] = (w_{1} + w_{2})\sigma^{2}$

To get an unbiased estimate for $\sigma^{2}$ , we need the weights to add up to 1. What could those weights be? Could it relate to the sample sizes?

With that idea in mind, let’s take a little detour.

Let me ask you a question.

What is the probability distribution that relates to sample variance?

You might have to go down memory lane to Lesson 73.

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom. We learned that the term $\frac{(n-1)s^{2}}{\sigma^{2}}$ is a sum of $(n-1)$ squared standard normal distributions.

So,

$\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} \sim \chi^{2}_{n_{1}-1}$ | $\frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n_{2}-1}$

Since the two samples are independent, the sum of these two terms, $\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}}$ and $\frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}}$ will follow a Chi-square distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

Add them and see. $\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}}$ is a sum of $(n_{1}-1)$ squared standard normal distributions, and $\frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}}$ is a sum of $(n_{2}-1)$ squared standard normal distributions. Together, they are a sum of $(n_{1}+n_{2}-2)$ squared standard normal distributions.

Since $\frac{(n_{1}-1)s_{1}^{2}}{\sigma_{1}^{2}} + \frac{(n_{2}-1)s_{2}^{2}}{\sigma_{2}^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}$

we can say,

$\frac{(n_{1}+n_{2}-2)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n{1}+n_{2}-2}$

So we can think of developing weights in terms of the degrees of freedom of the Chi-square distribution. The first sample contributes $n_{1}-1$ degrees of freedom, and the second sample contributes $n_{2}-1$ degrees of freedom towards a total of $n_{1}+n_{2}-2$ .

So the weight of the first sample can be $w_{1}=\frac{n_{1}-1}{n_{1}+n_{2}-2}$ and the weight of the second sample can be $w_{2}=\frac{n_{2}-1}{n_{1}+n_{2}-2}$ , and they add up to 1.

This then means that the equation for the estimate of the pooled variance is

$s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$

Since it is an unbiased estimate of $\sigma^{2}$ , we can use $s^{2}$ in place of $\sigma^{2}$ in our test-statistic, which then looks like this.

$t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$

where, $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ is the estimate for the pooled variance.

Did you notice that I use $t_{0}$ to represent the test-statistic?

Yes, I am getting to $T$ with it.

It is a logical extension of the idea that, for one sample, $t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}}$ follows a T-distribution with $(n-1)$ degrees of freedom.

There, the idea was derived from the fact that when you replace the population variance with sample variance, and the sample variance is related to a Chi-square distribution with $(n-1)$ degrees of freedom, the test-statistic $t_{0}=\frac{\bar{x}-\mu}{\sqrt{\frac{s^{2}}{n}}}$ follows a T-distribution with $(n-1)$ degrees of freedom. Check out Lesson 73 (Learning from “Student”) to refresh your memory.

Here, in the case of the difference in means between two samples ( $\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\sigma^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ ), the pooled population variance $\sigma^{2}$ is replaced by its unbiased estimator $(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ , which, in turn, is related to a Chi-square distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

Hence, under the proposition that the population variance of the two random variables $X_{1}$ and $X_{2}$ are equal, the test-statistic is $t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ , and it follows a T-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

Now let’s evaluate the hypothesis that Tom set up — Finally!!

We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

$t_{0} = \frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$

where $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$

In Tom’s case, both $n_{1}=n_{2}=10$ . So the weights will be equal to 0.5.

$s^{2}=0.5s_{1}^{2}+0.5s_{2}^{2}$

$s^{2}=0.5*58.28+0.5*54.89$

$s^{2}=56.58$

$t_{0} = \frac{12.5-12}{\sqrt{56.58(\frac{1}{10}+\frac{1}{10})}}$

$t_{0} = \frac{0.5}{\sqrt{11.316}}$

$t_{0} = 0.1486$

The test-statistic is 0.1486. Since the alternate hypothesis is that the difference is greater than zeros ( $H_{A}:\mu_{1}-\mu_{2}>0$ ), Tom has to verify how likely it is to see a value greater than 0.1486 in the null distribution. Tom has to reject the null hypothesis if this probability (p-value) is smaller than the selected rate of rejection. A smaller than $\alpha$ probability indicates that the difference is sufficiently large enough that, in a T-distribution with so many degrees of freedom, the likelihood of seeing a value greater than the test-statistic is small. In other words, the difference in the mean is already sufficiently greater than zero and in the region of rejection.

Look at this visual.

The distribution is a T-distribution with 18 degrees of freedom (10 + 10 – 2). Tom had collected ten samples each for this test. Since he opted for a rejection level of 10%, there is a cutoff on the distribution at 1.33.

1.33 is the quantile on the right tail corresponding to a 10% probability (rate of rejection) for a T-distribution with eighteen degrees of freedom.

If the test statistic ( $t_{o}$ ) is greater than $t_{critical}$ , which is 1.33, he will reject the null hypothesis. At that point (i.e., at values greater than 1.33), there would be sufficient confidence to say that the difference is significantly greater than zero.

This decision is equivalent to rejecting the null hypothesis if $P(T>t_{0})$ (the p-value) is less than $\alpha$ .

We can read $t_{critical}$ off the standard T-table, or $P(T>t_{0})$ can be computed from the distribution.

At df = 18, and $\alpha=0.1$ , $t_{critical}=1.33$ and $P(T>t_{0})=0.44$ .

Since the test-statistic $t_{0}$ is not in the rejection region, or since $p-value > \alpha$ , Tom cannot reject the null hypothesis $H_{0}$ that $\mu_{1} - \mu_{2}=0$

He cannot count on the theory that the factory is illegally dumping their untreated waste into the west branch Mohawk River until he finds more evidence.

Tom is not convinced. The endless newspaper stories on arsenic levels are bothering him. He begins to think if the factory is illegally dumping their untreated waste into both the west and east branches of the Mohawk River. That is one reason why he might have seen no significant difference in the concentrations.

Over the next week, he takes Ron with him to Utica River, a tributary of Mohawk that branches off right before Mohawk meets the factory. If his new theory is correct, he should find that the mean arsenic concentration in either the west or east branches should be significantly greater than the mean arsenic concentration in the Utica River.

Ron again helped him with the laboratory testing. In parts per billion, the third sample data looks like this.

Utica River: 4, 4, 6, 4, 5, 7, 8

There are 7 data points. $n_{3}=7$

The sample mean $\bar{x_{3}}=5.43$

The sample variance $s_{3}^{2} = 2.62$

Can you help Tom with his new hypothesis tests? Does the proposition still hold?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 73 – Learning from “Student”

On the fifteenth day of July 2018, Jenny and Joe discussed the confidence interval for the mean of the population.

The $100(1-\alpha)$ % confidence interval for the true mean $\mu$ is $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ .

$\bar{x}$ is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean $\mu$ and true standard deviation $\sigma$ .

The end-points $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ are called the lower and the upper confidence limits.

While developing the confidence interval of the mean water quality at the beach with a sample of 48, Jenny pointed out that the value for the true standard deviation, $\sigma$ is not known.

Joe collected a much larger sample (all data points available for the Rockaway beach) and computed an estimate for the true standard deviation. Based on the principle of consistency, he suggested that this estimate is close to the truth, so can be used as $\sigma$ .

Jenny rightly pointed out that not always we will have such a large sample. Most often, the data is limited.

Joe’s suggestion was to use sample standard deviation, s, i.e., the estimated standard deviation from the limited sample in place of $\sigma$ . However, Jenny was concerned that this will introduce more error into the estimation of the intervals.

March 1908

This was a concern for W. S. Gosset (aka “Student”) in 1908. He points out that one way of dealing with this difficulty is to run the experiment many times, i.e., collect a large enough sample so that the standard deviation can be computed once for all and used for subsequent similar experiments.

He further points out that there are numerous experiments that cannot easily be repeated often. In the situation where a large sample cannot be obtained, one has to rely on the results from the small sample.

The standard deviation of the population ( $\sigma$ ) is not known a priori.

The confidence interval Joe and Jenny derived is based on the conjecture that the distribution of the sample mean ( $\bar{x}$ ) tends to a normal distribution. $\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$ , and if the sample size is large enough, it would be reasonable to substitute sample standard deviation s in place of the population standard deviation $\sigma$ . But it is not clear how “large” the sample size should be.

The sample standard deviation can be computed as $s=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}$ . While s is a perfectly good estimate of $\sigma$ , it is not equal to $\sigma$ .

“Student” pointed out that when we substitute s for $\sigma$ , we cannot just assume that $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ will tend to a normal distribution just like $Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$ .

He derived the frequency distribution of $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ in his landmark paper “The Probable Error of a Mean.”

This distribution came to be known as the Student’s t-distribution.

In his paper, he did not use the notation t though. He referred to it as quantity z, obtained by dividing the distance between the mean of a sample and the mean of the population by the standard deviation of the sample.

He derived the distribution of the sample variance and the sample standard deviation, showed that there is no dependence between s and $\bar{x}$ and used this property to derive the joint distribution of $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ .

Today, we will go through the process of deriving the probability distributions for the sample variance $s^{2}$ , sample standard deviation s and the quantity t.

$t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$

It is important that we know these distributions. They will be a recurring phenomenon from now on and we will be using them in many applications.

I am presenting these derivations using standard techniques. There may be simpler ways to derive them, but I found this step by step thought and derivation process enriching.

During this phase, I will refer back to “Student’s” paper several times. I will also use the explanations given by R.A Fisher in his papers on “Student.”

You can follow along these steps using a pen and paper, or you can just focus on the thought process and skip the derivations, either way, it is fun to learn from “Student.” Trust me.

$t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$

To know the distribution of t, we should know the distributions of $\bar{x}$ and s.

We already know that $\bar{x}$ tends to a normal distribution; $\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$ , but what is the limiting distribution of s, i.e., what is the probability distribution function f(s) of the sample standard deviation?

To know this, we should first know the limiting distribution of the sample variance $s^{2}$ , from which we can derive the distribution of s.

What is the frequency distribution of the sample variance?

Expressing variance as the sum of squares of normal random variables.

Let’s take the equation of the sample variance $s^{2}$ and see if there is a pattern in it.

$s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}$

Move the $n-1$ over to the left-hand side and do some algebra.

$(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}$

$(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}$

$(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}$

Let’s divide both sides of the equation by $\sigma^{2}$ .

$\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}$

The right-hand side now looks like the sum of squared standard normal distributions.

$\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$

Sum of squares of (n – 1) standard normal random variables.

Does that ring a bell? Sum of squares is the language of the Chi-square distribution. We learned this in lesson 53.

If there are n standard normal random variables, $Z_{1}, Z_{2}, ..., Z_{n}$ , their sum of squares is a Chi-square distribution with n degrees of freedom. Its probability density function is $f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!}$ for $\chi > 0$ and 0 otherwise.

Sine we have $\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with (n-1) degrees of freedom.

$\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n-1}$ with a probability distribution function

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}$

⇔

The frequency distribution of the sample variance.

It turns out that, with some modification, this equation is the frequency distribution $f(s^{2})$ of the sample variance.

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}$

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}(\frac{(n-1)s^{2}}{2\sigma^{2}})^{\frac{n-3}{2}}}{(\frac{n-3}{2})!} e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(\frac{n-1}{\sigma^{2}})^{\frac{n-1}{2}}\frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2^{\frac{n-3}{2}}}(s^{2})^{\frac{n-3}{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}[\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}]$

The above equation can be viewed as $f(aX) = \frac{1}{a}f(X)$ where $X = s^{2}$ and $a = \frac{n-1}{\sigma^{2}}$ .

These few steps will clarify this further.

Let $Y = aX$

$P(Y \le y) = P(aX \le y)$

$P(Y \le y) = P(X \le \frac{1}{a}y)$

$F_{Y}(y) = F_{X}(\frac{1}{a}y)$

$\frac{d}{dy}(F(y)) = \frac{d}{dy}F_{X}(\frac{1}{a}y)$

Applying the fundamental theorem of calculus and chain rule together, we get,

$f(y) = \frac{1}{a}f(\frac{1}{a}y)$

$f(aX) = \frac{1}{a}f(X)$

Hence, the frequency distribution of $s^{2}$ is

$f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

From $f(s^{2})$ , we can derive the probability distribution of s, i.e., $f(s)$ .

WHAT IS THE FREQUENCY DISTRIBUTION OF THE SAMPLE Standard Deviation?

Here, I refer you to this elegant approach by “Student.”

“The distribution of s may be found from this ( $s^{2}$ ), since the frequency of s is equal to that of $s^{2}$ and all that we must do is to compress the base line suitably.”

Let $f_{1} = f(s^{2})$ be the distribution of $s^{2}$ .

Let $f_{2} = f(s)$ be the distribution of s.

Since the frequency of $s^{2}$ is equal to that of s, we can assume,

$f_{1}ds^{2} = f_{2}ds$

In other words, the probability of finding $s^{2}$ in an infinitesimal interval $ds^{2}$ is equal to the probability of finding s in an infinitesimal interval $ds$ .

We can reduce this using substitution as

$2sf_{1}ds = f_{2}ds$

$f_{2} = 2sf_{1}$

The frequency distribution of s is equal to 2s multiplied by the frequency distribution of $s^{2}$ .

Hence, the frequency distribution of s is

$f(s) = 2s\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

$f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

WHAT IS THE FREQUENCY DISTRIBUTION OF t?

We are now ready to derive the frequency distribution of t.

Some of the following explanation can be found in R. A. Fisher’s 1939 paper. I broke it down step by step for our classroom.

$t = \frac{\bar{x}-\mu}{s/\sqrt{n}}$

$\bar{x} - \mu = \frac{st}{\sqrt{n}}$

For a given value of s, $d\bar{x} = \frac{s}{\sqrt{n}}dt$

The frequency distribution of $\bar{x}$ is

$\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(\bar{x}-\mu)^{2}}$

Remember, $\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$ .

Substituting $\bar{x} - \mu = \frac{st}{\sqrt{n}}$

The distribution becomes

$\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(st/\sqrt{n})^{2}}$

$\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}$

For a given value of s, the probability that $\bar{x}$ will be in $d\bar{x}$ is

$df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}d\bar{x}$

Substituting $d\bar{x} = \frac{s}{\sqrt{n}}dt$ , we can get

$df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \frac{s}{\sqrt{n}}dt$

$df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} s dt$

Fisher points out that for all values of s, we can substitute the frequency distribution of s and integrate it over the interval 0 and $\infty$ . This can be done because $\bar{x}$ and s are independent.

So the joint distribution becomes

$df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} s dt e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

In the next few steps, I will rearrange some terms to convert the integral into a recognizable form.

$df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} dt \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}s \frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

$df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}} \frac{1}{\sigma^{n-1}}dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

$df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}\frac{(n-1)^{\frac{n}{2}}}{(n-1)^{1/2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

$df = \frac{1}{\sqrt{2\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-2}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds$

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{n -1 + t^{2}}{2\sigma^{2}}s^{2}}ds$

Hang in there. We will need some more concentration!

Let $k = \frac{n - 1 + t^{2}}{2\sigma^{2}}$

The equation becomes

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-ks^{2}}ds$

Some more substitutions.

Let $p = ks^{2}$

Then $dp = 2ksds$

The equation can be reduced as

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-p}\frac{1}{2ks}dp$

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}\frac{1}{2k}s^{n-2} e^{-p}dp$

Since $s = \frac{p^{1/2}}{k^{1/2}}$

The equation becomes,

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2k^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

Replacing $k = \frac{n - 1 + t^{2}}{2\sigma^{2}}$

The equation becomes

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2(\frac{n - 1 + t^{2}}{2\sigma^{2}})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

Some more reduction …

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2\sigma^{2})^{n/2}}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

$df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2^{n/2}\sigma^{n})}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

$df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}} dt \frac{1}{(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

$df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(\frac{n-1+t^{2}}{n-1})^{\frac{n}{2}}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

$df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} \int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp$

The integral you see on the right is a Gamma function that is equal to $(\frac{n-2}{2})!$ for positive integers.

There we go, the t-distribution emerges

$df = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} dt$

The probability distribution of t is

$f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}$

It is defined as t-distribution with (n-1) degrees of freedom. As you can see, the function only contains n as a parameter.

The probability of t within any limits is fully known if we know n, the sample size of the experiment.

“Student” also derived the moments of this new distribution as

$E[T] = 0$

$V[T] = \frac{n-1}{n-3}$

The function is symmetric and resembles the standard normal distribution Z.

The t-distribution has heavier tails, i.e. it has more probability in the tails than the normal distribution. As the sample size increases (i.e., as $n \to \infty$ ) the t-distribution approaches Z.

You can look at the equation and check out how it converges to Z in the limit when $n \to \infty$ .

These points are illustrated in this animation.

I am showing the standard Z in red color. The grey color functions are the t-distributions for different values of n. For small sample sizes, the t-distribution has fatter tails — as the sample size increases, there is little difference between T and Z.

😫😫😫

I am pretty sure you do not have the energy to go any further for this week. I don’t too.

So here is where we stand.

$f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

$f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}$

and finally

The probability distribution of t is

$f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}$

See you next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Category: Student’s T-distribution

Lesson 99 – The Two-Sample Hypothesis Tests in R

The Basic Steps

Step1: Get the data

Step 2: Create a new folder on your computer

Step 3: Create a new code in R

Step 4: Choose your working directory

Step 5: Read the data into R workspace

Step 6: Extracting a subset of the data

The Questions for Hypotheses

Hypothesis Test on the Difference in Means

Hypothesis Test on the Equality of Variances

Hypothesis Test on the Difference in Proportions

Here is the full code for today’s lesson.

Lesson 95 – The Two-Sample Hypothesis Test – Part IV

On the Difference in Means

using Welch’s t-Test

How do we proceed when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ ?

We should now identify the limiting distribution of $V[y]$

Let’s now go back to Tom and help him with his new theory.

Lesson 94 – The Two-Sample Hypothesis Test – Part III

On the Difference in Means

using the t-Test

Let’s take Tom on a mathematical excursion

Hold on!

Say, $\sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}$

Now let’s evaluate the hypothesis that Tom set up — Finally!!

Lesson 73 – Learning from “Student”

March 1908

What is the frequency distribution of the sample variance?

WHAT IS THE FREQUENCY DISTRIBUTION OF THE SAMPLE Standard Deviation?

Enjoy this blog? Please spread the word :)

The Basic Steps

Step1: Get the data

Step 2: Create a new folder on your computer

Step 3: Create a new code in R

Step 4: Choose your working directory

Step 5: Read the data into R workspace

Step 6: Extracting a subset of the data

The Questions for Hypotheses

Hypothesis Test on the Difference in Means

Hypothesis Test on the Equality of Variances

Hypothesis Test on the Difference in Proportions

Here is the full code for today’s lesson.

On the Difference in Means

using Welch’s t-Test

How do we proceed when ?

We should now identify the limiting distribution of

Let’s now go back to Tom and help him with his new theory.

On the Difference in Means

using the t-Test

Let’s take Tom on a mathematical excursion

Hold on!

Say,

Now let’s evaluate the hypothesis that Tom set up — Finally!!

March 1908

What is the frequency distribution of the sample variance?

WHAT IS THE FREQUENCY DISTRIBUTION OF THE SAMPLE Standard Deviation?

Enjoy this blog? Please spread the word :)

How do we proceed when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ ?

We should now identify the limiting distribution of $V[y]$

Say, $\sigma_{1}^{2}=\sigma_{2}^{2}=\sigma^{2}$