Lesson 81 – Riding with confidence, in R: Week 2

Monday, February 18, 2019

In which the students come back with samples. Jenny shows them sampling distributions, the real meaning of confidence interval, and a few other exciting things using their data. She sends them off on a second task.

Where is the Mean

Samantha, John, and Christina explore

Samantha: These are the London planetrees I have data for.

The 100(1-\alpha)\% confidence interval of the population mean (\mu) is the interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}].

I have a sample of 30 trees. For these 30 trees, the sample mean is 20.6 inches and the sample standard deviation is 9.06 inches. Based on this, the confidence interval of \mu is [20.6 - 2.05\frac{9.06}{\sqrt{30}}, 20.6 + 2.05\frac{9.06}{\sqrt{30}}].

[17.22 inches, 23.98 inches]

John: I collected data for 30 trees too. Here is my data.

And here is the confidence interval I came up with; [17.68 inches, 24.05 inches]. For my data, the sample mean is 20.87 inches and the sample standard deviation is 8.52 inches.

Christina: I have a sample of 30 too. Here is where I took them from.

And, here is the 95% confidence interval of the mean; [19.9 inches, 24.9 inches].

Their sample statistics are different. Their confidence intervals are different. They begin to wonder whether their interval contains the truth, \mu, or, whose interval contains the truth?

Jenny puts all the intervals in context

Jenny shares with the students, an empty data file with the following headers.

The students fill out the file with the data they collected last week. Her class has 40 students, so when the file came back to Jenny, it had a total of 40*30 = 1200 row entries.

Jenny uploaded this file here, for the analysis she is going to show the class.

Jenny: I am sure you all had much fun last week visiting different places in the city and collecting data for the analysis. I am hoping that all of you randomly selected the places to visit. Based on what Sam and Christina showed, it looks like the places are fairly spread out — so we would have gotten a random sample from the population of the trees.

Sam, John, and Christina; the three of them computed the 95% confidence interval of the true mean \mu based on their sample statistics. They found their intervals to be different.

Let’s look at all the 40 confidence intervals. Now that I have the data from you all, I will show you how to do it in R. Essentially, I will take each one of your data, compute the sample statistics and use them to compute the respective confidence intervals. We have 40 different samples, so we will get 40 different confidence intervals — each interval is a statement about the truth. If you remember what we have been discussing about the confidence intervals, for a 95% confidence level,

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of \mu.

In other words, approximately 95% of the 40 samples (38 of them) may contain the truth. 5% (2 of them) may not contain the truth.

Let’s see who is covering the truth and who is not!

I want you all to work out the code with me. Here we go.

First, we will read the data file into R workspace.

# Reading the tree data file from the students # 
students_data = read.csv("students_data.csv",header=T)

Next, use the following lines to compute the confidence interval for each student. I am essentially repeating the computation of sample mean, sample standard deviation and the confidence intervals, in a loop, 40 times, one for each student.

#40 students went to collect data from random locations in NYC # 
#Each student collects 30 samples #
#Each student develops the confidence intervals #

nstudents = 40
nsamples = 30

students = 1:nstudents

alpha = 0.95
t = alpha + (1-alpha)/2

sample_mean = matrix(NA,nrow=nstudents,ncol=1)
sample_sd = matrix(NA,nrow=nstudents,ncol=1)

ci_t = matrix(NA,nrow=nstudents,ncol=2)

for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)

sample_data = students_data$tree_dbh[ind]

sample_mean[i] = mean(sample_data)
sample_sd[i] = sd(sample_data)

cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))

ci_t[i,] = c(cit_lb,cit_ub)
}

Now, let’s plot the 40 intervals to see them better. A picture is worth a thousand numbers. Use these lines. They explain themselves.

#Plot all the CIs #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")

for (i in 2:nstudents)
{
stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_mean[i],col="green", add=T, at = i)
}

The students are overjoyed looking at the pretty image.

John: This is neat. All our intervals are different, owing to each one of us bringing a different random sample. How we know which one of us contains the truth and which one of us does not is still not clear.

Jenny: We can get a close approximation based on the principle of consistency.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0

As n approaches infinity, the sample estimate approaches the true parameter. We can take the full data record that you all brought, 1200 data points and compute the overall mean.

true_mu = mean(students_data$tree_dbh)

While it is not exactly the true value, based on the idea of consistency, we can assume that it is converging to the true one.

\mu = 21.3575 inches.

Look at this.

# Truth Convergence Plot #
true_mean = sample_mean
for (i in 1:nstudents)
{
ind = which(students_data$student_index <= i)
true_mean[i] = mean(students_data$tree_dbh[ind])
}

plot(true_mean,type="o",xlab="Increasing Sample",ylab="Convergence of the Sample Mean",font=2,font.lab=2,ylim=c(20,22))

My sixth sense and divine visions also tell me that this is converging to the truth!

Now, we can check which of these intervals contains the true mean \mu, 21.3575 inches.

You can use this simple code to check.

false_samples = students
for (i in 1:nstudents)
{
if( (ci_t[i,1] > true_mu) || (ci_t[i,2] < true_mu) )
{false_samples[i]=1} else
  {false_samples[i]=0}
}
false_ind = which(false_samples == 1)

It looks like number 4 and number 26 are not covering the truth. Harry and Miguel.

Let us point them out. Here, look at another pretty plot. The thick black vertical line is the truth, 21.3575 inches. The brown line is how the sample mean (\bar{x}) is converging to the truth (\mu) as we collect more samples. Harry and Miguel stand out. That is 2 out of 40; 5%.

# Plot all the CIs; now show the false samples #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")

for (i in 2:nstudents)
{
stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_mean[i],add=T,col="green",at = i)
}

abline(v=true_mu,lwd=3)
lines(true_mean,students,type="o",col="brown",cex=0.5)

for (i in 1:length(false_ind))
{
j = false_ind[i]

stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
}

text(16,4.6,"Harry")
text(26,26.6,"Miguel")

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of \mu.

Miguel: What happens to the confidence interval when we compute it with larger sample sizes? I mean, what is the relationship between the confidence interval and sample size?

Jenny: Look at the equation for the confidence interval of the population mean; [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}].

Theoretically, as n tends to infinity, the interval shrinks to 0 since the sample mean converges to the true mean. So, as n gets larger, the interval gets smaller. Let me show you how it works. Take our data, and, just as we computed the sample mean for increasing sample size to see the convergence, compute the confidence interval also at each step.

# How does CI change with more sample size
sample_mean_update = matrix(NA,nrow=nstudents,ncol=1)
sample_sd_update = matrix(NA,nrow=nstudents,ncol=1)

ci_t_update = matrix(NA,nrow=nstudents,ncol=2)

for(i in 1:nstudents)
{
ind = which(students_data$student_index <= i)
ns = length(ind)

sample_data = students_data$tree_dbh[ind]

sample_mean_update[i] = mean(sample_data)
sample_sd_update[i] = sd(sample_data)

cit_lb = sample_mean_update[i] - qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))
cit_ub = sample_mean_update[i] + qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))

ci_t_update[i,] = c(cit_lb,cit_ub)
}

Now, we can plot these confidence intervals against increasing sample size to see what happens. Just remember that each interval is an interval that is based on an increased sample size than the previous. The first interval has 30 data points, the second one has 60 data points, the third one, 90, the fourth, 120 etc.

# Plot all the CIs #
stripchart(ci_t_update[1,],vertical=T,type="l",col="green",main="",ylab="Diameter (inches)",ylim=c(16,25),xlim=c(1,nstudents),xlab="Increasing Sample Size")
stripchart(sample_mean_update[1],vertical=T,add=T,col="green")

for (i in 2:nstudents)
{
stripchart(ci_t_update[i,],type="l",col="green",main="",add=T,at = i,vertical=T)
stripchart(sample_mean_update[i],col="green", add=T, at = i,vertical=T)
}

lines(students,true_mean,type="o",col="brown",cex=0.5)

See how the interval gets smaller and smaller as the sample size increases. If we do this enough number of times, the interval vanishes — as we will approach the entire population.

T or Z

Christina: If we have a large sample, why can’t we use [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] as the 100(1-\alpha)\% confidence interval for the true mean \mu? For a large sample size, we can assume s, the sample standard deviation of the large sample as \sigma.

Jenny: Yes, we can do that. As n increases, the difference between T and Z is minimal. However, it is better to use t-distribution for confidence intervals of \mu, so we don’t fall into traps. I will show those traps next week.

Where is the Standard Deviation and Proportion

Jenny: Can you follow the same logic and develop the confidence intervals for \sigma? How many of you will cover the truth, how many of you will not?

Again, approximately, 95% of the intervals will cover the true \sigma, i.e., roughly 38 of your intervals will have the truth, 2 will not. Check this out.

alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u

sample_var = matrix(NA,nrow=nstudents,ncol=1)
ci_sd = matrix(NA,nrow=nstudents,ncol=2)

for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)
ns = length(ind)

sample_data = students_data$tree_dbh[ind]

sample_var[i] = var(sample_data)

chi_low = ((ns-1)sample_var[i])/qchisq(u,df=(ns-1))
chi_up = ((ns-1)sample_var[i])/qchisq(l,df=(ns-1))

ci_sd[i,] = c(sqrt(chi_low),sqrt(chi_up))
}

true_sd = sqrt((sum((students_data$tree_dbh-true_mu)^2))/length(students_data$tree_dbh))

# False samples #
false_samples_sd = students
for (i in 1:nstudents)
{
if( (ci_sd[i,1] > true_sd) || (ci_sd[i,2] < true_sd) )
{false_samples_sd[i]=1} else
{false_samples_sd[i]=0}
}

false_sd_ind = which(false_samples_sd == 1)

# Plot all the CIs; now show the false samples #
stripchart(ci_sd[1,],type="l",col="green",main="CI on Standard Deviation",xlab="Diameter (inches)",xlim=c(5,15),ylim=c(1,nstudents))
stripchart(sample_sd[1],add=T,col="green")

for (i in 2:nstudents)
{
stripchart(ci_sd[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_sd[i],add=T,col="green",at = i)
}

abline(v=true_sd,lwd=3)

Jenny: Here all the intervals cover the truth, but, in the long-run, 95% of the intervals cover the truth.

John: How about proportion?

Jenny: Yes, can you all develop the confidence intervals for the proportion of damaged trees? I want you to use bootstrap confidence intervals for p instead of the one based on the assumption of normal distribution.

So the students typed away a few lines of code to finally create this.

# Confidence Interval of p #
alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u

nboot = 1000

sample_p = matrix(NA,nrow=nstudents,ncol=1)
ci_p_boot = matrix(NA,nrow=nstudents,ncol=2)

for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)
ns = length(ind)

sample_data = students_data$brnch_ligh[ind]

sample_p[i] = length(which(sample_data=="Yes"))/ns

bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)
for (j in 1:nboot)
{
ind2 = 1:ns
bootstrap_ind = sample(ind2,ns,replace=T)
bootstrap_sample_damage = sample_data[bootstrap_ind]
bootstrap_replicate_proportion[j,1] = length(which(bootstrap_sample_damage=="Yes"))/ns }

ci_p_boot[i,] = quantile(bootstrap_replicate_proportion,c(l,u))
}

true_p = length(which(students_data$brnch_ligh=="Yes"))/length(students_data$brnch_ligh)

# False samples #
false_samples_p = students
for (i in 1:nstudents)
{
if( (ci_p_boot[i,1] > true_p) || (ci_p_boot[i,2] < true_p) )
{false_samples_p[i]=1} else
{false_samples_p[i]=0}
}

false_p_ind = which(false_samples_p == 1)

# Plot all the CIs; now show the false samples #
stripchart(ci_p_boot[1,],type="l",col="green",main="CI on Proportion",xlab="Proportion",xlim=c(0,0.5),ylim=c(1,nstudents))
stripchart(sample_p[1],add=T,col="green")

for (i in 2:nstudents)
{
stripchart(ci_p_boot[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_p[i],add=T,col="green",at = i)
}

abline(v=true_p,lwd=3)

for (i in 1:length(false_p_ind))
{
j = false_p_ind[i]
stripchart(ci_p_boot[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
stripchart(sample_p[j],col="red", lwd=3, add=T, at = j)
}

You can also check out their full code here.

As the class comes to an end, Samantha looks fulfilled. “Wow, that is quite a day,” she said. “We covered many things today, but more importantly the true meaning of the confidence intervals. There is a 95% probability of selecting a sample whose confidence interval will contain the true value.”

“Assuming our data is collected without any bias,” interjected Jenny. The students looked puzzled.

“This week, I want you to go out and collect the data again, but this time, divide yourselves into boroughs. Eight of you will belong to team Manhattan and will collect tree data from Manhattan only. You will still collect data randomly but will do it only from Manhattan. Eight of you are team Queens and you collect data randomly from Queens. The other three teams, team Bronx, team Brooklyn, and team Staten Island will do the same in their boroughs. We will again have 30 trees each, and a total of 1200 trees.”

So, the students go off on another mission. Some of them are still puzzled or rather have a “what difference does it make” face on them. Some of them are as usual excited to hug the trees again. We will all wait and see what they come up with. More fun with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 80 – Riding with confidence, in R: Week 1

Monday, February 11, 2019

In which Jenny shows her students how to compute and visualize confidence intervals in R, and sends them off on a task.

“Did you know that the symbol of the department of parks and recreations is a cross between the leaf of the London plane and a maple leaf?” asked Jenny as she opened her high school data class. She volunteered to help 40 high school kids recently.

The obvious response from the kids is no response.

“It is,” she said. “More than 10% of the trees are London Planetrees. They grow up to 20 – 30 meters, with a trunk circumference of 3 meters.”

She projected a map on the big screen.

“These are the places I randomly selected to visit and record the diameter of the trunk of the London planetrees. You have this data in the handout. We will learn how to develop confidence intervals in R.”

For your analysis in R, the data file can be downloaded from here.”

Let’s Start with the Mean

“Can someone tell me how we can get the confidence interval of the mean diameter?” she asked.

John raised his hand to answer. “Yes, John, go ahead,” said Jenny.

“The 100(1-\alpha)\% confidence interval for the true mean \mu is [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]. If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown \sigma. However, for small sample sizes, the sample standard deviation s is itself subject to error. It may be far from the true value of \sigma. So it is preferable to use t-distribution for the confidence interval of the mean since \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} tends to a t-distribution with (n-1) degrees of freedom.

So I would say that the 100(1-\alpha)\% confidence interval of the population mean is the interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}].”

“Excellent. Let’s all compute the 95% confidence interval of the mean diameter of the trunk of the London planetrees from the sample I gave you. The 95% confidence interval of the true mean diameter (\mu) is \bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}.”

They all started calculating. Some of them pulled out t-tables to get the t-critical (t_{0.025,29}), i.e., the quantile from the t-distribution corresponding to the upper tail probability of 0.025, and 29 degrees of freedom.

Jenny stopped them. “Don’t compute the interval by hand. Let’s do it in R.

“Read the data file first. I am assuming that you all have the file in the same folder that you set the working directory to.”

# Read the data file -- Jenny's sample data #
jenny_sample_data = read.csv("jenny_data.csv",header=T)
nsamples = nrow(jenny_sample_data)

“Did you notice that the file has seven columns? The diameter of the tree trunk is recorded in the first column, tree_dbh. I am sure you figured out what the other columns are, except maybe the column that reads brnch_ligh that has Yes or No as the inputs. We will get to that in a bit.”

“Let’s compute the sample mean (\bar{x}), sample variance (s^{2}), and the sample standard deviation (s).”

#statistics of the sample #
jenny_samplemu = mean(jenny_sample_data$tree_dbh)
jenny_samplevar = var(jenny_sample_data$tree_dbh)
jenny_samplesd = sd(jenny_sample_data$tree_dbh)

“What you were about to look up from the t-table can be obtained using this command in R.”

qt(0.975,df=29)
[1] 2.04523

“The t-critical value is 2.04523.”

“We can write a few lines to compute the confidence interval for any \alpha level, and show them graphically.”

# Confidence interval of the mean using t-distribution #
alpha = 0.95 # 95% confidence interval
t = alpha + (1-alpha)/2
n = nsamples

cit_lb = jenny_samplemu - qt(t,df=(n-1))*((jenny_samplesd/sqrt(n)))
cit_ub = jenny_samplemu + qt(t,df=(n-1))*((jenny_samplesd/sqrt(n)))

ci_t = c(cit_lb,cit_ub)

stripchart(ci_t,type="l",lwd=3,col="green",main="CI on Mean",xlab="Diameter (inches)")
stripchart(jenny_samplemu,add=T,lwd=3,col="green")

“The sample mean (\bar{x}) is 22.433 inches and the intervals extend out to 19.67 inches on the left and 25.20 inches on the right.”

“There is a 95% probability of selecting this sample for which the confidence interval will contain the true value of \mu.”

The kids looked happy seeing the image on their screen.

Next, let’s explore the Variance and the Standard Deviation

“Do you remember the equation for the confidence interval of the variance and the standard deviation?” Jenny asked.

Samantha wanted to answer.

“The 100(1-\alpha)\% confidence interval of the population variance \sigma^{2} is the interval [\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]. We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation,” she said.

“The interval [\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}] is called the 100(1-\alpha)\% confidence interval of the population standard deviation \sigma.”

Jenny was pleased. She has an outstanding cohort. They always come prepared.

“Yes. In our case, \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with 29 degrees of freedom. The lower and upper critical values at the 95% confidence interval \chi_{l,29} and \chi_{u,29} can be obtained from the Chi-square table, or, as you guessed, can be computed in R using a simple command. Try these,” she said.

qchisq(0.975,df=29)
[1] 45.72229
qchisq(0.025,df=29)
[1] 16.04707

“Of course, we can automate this and make nice graphics for the intervals. Use this code.”

# Confidence interval of the variance and standard deviation using Chisq-distribution #
u = alpha + (1-alpha)/2
l = 1 - u

chi_low = ((n-1)*jenny_samplevar)/qchisq(u,df=(n-1))
chi_up = ((n-1)*jenny_samplevar)/qchisq(l,df=(n-1))

ci_var = c(chi_low,chi_up)
ci_sd = c(sqrt(chi_low),sqrt(chi_up))

stripchart(ci_var,type="l",lwd=3,col="green",main="CI on Variance",xlab="Diameter^2 (inches^2)")
stripchart(jenny_samplevar,add=T,lwd=3,col="green")

stripchart(ci_sd,type="l",lwd=3,col="green",main="CI on Standard Deviation",xlab="Diameter (inches)")
stripchart(jenny_samplesd,add=T,lwd=3,col="green")

“The sample variance is 54.74 inches^{2}, with a lower bound of 34.72 inches^{2} and an upper bound of 98.92 inches^{2}. Notice how the intervals are not symmetric. Do you remember why?”

“The sample standard deviation is 7.40 inches. The 95% lower and upper bounds are 5.90, and 9.95 inches respectively.”

The kids spent a few minutes typing up the lines of code to get the intervals and make the graphics.

Jenny continued the lesson.

“Now, look at the third column in the data file, brnch_ligh. When I measured the trunk circumference, I also noticed that, for some trees, there were branch problems caused by lights or wires. I recorded this as a “yes” or a “no” depending on whether or not there was a problem. Who would think to have nice colorful lights on the trees would be damaging 🙁

Can one of you count through the sample and tell me what is the estimate of the proportion of trees (\hat{p}) that are damaged.”

“0.1,” someone answered from the back.

“Great, now, that invisible person, can you tell me how we can compute the confidence interval of the true proportion?”

Christina it was. She answered, “the 100(1-\alpha)% confidence interval for the true proportion p is [\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}], assuming that the estimate \hat{p} can be approximated by a normal distribution for a reasonably large sample size.”

“And, here is how you do that in R,” added Jenny, as she typed up these lines.

# Confidence interval of the proportion using z-distribution # 
jenny_sampleproportion = length(which(jenny_sample_data$brnch_ligh=="Yes"))/nsamples

z = alpha + (1-alpha)/2

p_low = jenny_sampleproportion - qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))
p_up = jenny_sampleproportion + qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))

ci_p = c(p_low,p_up)

stripchart(ci_p,type="l",lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="green")

“The sample estimated damage proportion (\hat{p}) is 0.1 since there are 3 damaged trees in a sample of 30 trees. The lower bound and the upper bound of the 95% confidence interval are -0.007 and 0.207, symmetric around 0.1,” Jenny interpreted as she looked at the plot and then at the kids.

Some students were shaking their head in disagreement. They don’t like the fact that the confidence interval produced a negative value. What could have gone wrong?

Christina jumped up at once. “Didn’t we learn last week that when p is at the boundaries, the sampling distribution exhibits skew and is not symmetric. Then it is not appropriate to approximate it to a normal distribution even for large sample sizes. We should perhaps develop the sampling distribution and the confidence intervals of the damage proportion using the idea of the Bootstrap?”

“That is an astute observation Christina,” said Jenny. “Yes, since we have an estimate close to 0, it is better to develop the bootstrap confidence intervals. That will eliminate the errors induced due to inappropriate assumptions,” she added.

Let’s do a Bootstrap in R

“The basis for the bootstrap is that the sample data of 30 trees can be used to approximate the probability distribution function of the population. By putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f. It is easy enough to think of drawing numbers with replacement from these 30 numbers. Since each value is equally likely, the bootstrap sample will consist of numbers from the original data, some may appear more than one time, and some may not appear at all in a random sample.” Jenny explained the core concept of the bootstrap once again before showing them the code to do it.

# Bootstrap confidence intervals ## 
nboot = 1000
bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)

bootstrap_replicate_mean = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_var = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_sd = matrix(NA,nrow=nboot,ncol=1)

for (i in 1:nboot)
{
ind = 1:nsamples
bootstrap_ind = sample(ind,nsamples,replace=T)

bootstrap_sample_damage = jenny_sample_data$brnch_ligh[bootstrap_ind]

bootstrap_replicate_proportion[i,1] = length(which(bootstrap_sample_damage=="Yes"))/nsamples

bootstrap_sample_diam = jenny_sample_data$tree_dbh[bootstrap_ind]

bootstrap_replicate_mean[i,1] = mean(bootstrap_sample_diam)
bootstrap_replicate_var[i,1] = var(bootstrap_sample_diam)
bootstrap_replicate_sd[i,1] = sd(bootstrap_sample_diam)
}

“Here, I am drawing the bootstrap sample 1000 times, and, for each bootstrap sample, I am computing the proportion, the mean, the variance, and the standard deviation. So, in the end, I will have 1000 replicates of the damage proportion, the sample mean, sample variance and standard deviation — our sampling distributions.

Drawing a bootstrap in R is very simple. Just use the “sample” command.

ind = 1:nsamples
sample(ind,nsamples,replace=T)

I first create an indicator vector that has numbers 1 to 30. From this vector, I draw samples with replacement to get numbers from 1 to 30, some may appear more than once, some may not appear, depending on the sample. These are the trees that we selected as part of the bootstrap sample. From these trees, we take the diameter and whether or not it is damaged due to lights. The rest of the lines in the loop are just to compute the statistics from the bootstrap sample.” Jenny clearly explained here lines.

“Now, let’s plot the distribution of the damage proportion derived from the bootstrap samples. Type these lines.”

# Distribution of the sample proportion #
hist(bootstrap_replicate_proportion,main="Bootstrap Replicates of the Damage Proportion",font=2,font.lab=2,xlab="Sample Proportion")

ci_p_boot = quantile(bootstrap_replicate_proportion,c(0.025,0.975))
abline(v=ci_p_boot,col="red")

“The 95% lower and the upper confidence limits are the 2.5th and the 97.5th percentiles of the sampling distribution. We can show them as red lines on the distribution.”

“Compare this to the previous confidence interval computed by assuming a normal distribution. We can see that the sampling distribution of the proportion is skewed and the interval is asymmetric. It is also not producing any negative values,” said Jenny as she showed a few more lines of tricks.

# compare with previous CI # 
stripchart(ci_p_boot,type="l",lwd=3,col="black",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="black")
text(0.17,1.1, "Bootstrap Confidence Interval")

stripchart(ci_p,type="l",add=T,at=0.7,lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,at=0.7,add=T,lwd=3,col="green")

“I have a few more things in here to compute the bootstrap confidence interval of the mean, variance, and the standard deviation. You can check them out for yourself.

Here is the full code.

Jenny was preparing to end the class.

Can we know the true mean, variance and the proportion?” Someone asked.

Jenny was happy to hear this. She had planned an assignment for them exactly for this.

“From this sample alone, we cannot say. We can, however, say that there is a 95% probability of selecting this sample for which the confidence interval will contain the true value of \mu, \sigma^{2} and p. But, I want you to understand it by doing it yourself.

So, here’s the deal. This week, I want you to go out to the city and collect a sample of 30 trees. For each tree, I want you to measure the diameter and whether or not it is damaged. Each one of you will randomly select 30 locations in the city and bring back a sample. Make sure you more or less cover all the boroughs. In the end, we will have 40 different samples, each with 30 data points. We will have much fun analyzing that data, answering your questions, and getting a deeper understanding of the confidence intervals, in R.” Jenny ended her session by saying this.

So the kids left with excitement, to spend their spare time during the week collecting data for the trees in the City. You may encounter them this week. They might be hugging a tree or two. Don’t mistake them for some tree-hugging hippies. They are teaching you how to be confident with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 79 – Pull yourself up by your bootstraps

Over the past several lessons we have been learning about estimates, standard error of the estimates and confidence interval of the estimates.

We have been using the ‘sample’ to estimate the true value of the parameter. What we estimate from the sample will enable us to obtain the closest answer in some sense for the true unknown population parameter.

For example, the mean \bar{x}, variance s^{2}, or proportion \hat{p} computed from the sample data are good guesses (estimates or estimators) of the mean \mu, variance \sigma^{2} and proportion p of the population.

We also know that when we think of an estimate, we think of an interval or a probability distribution, instead of a point value. The truth may be in this interval if we have good representative samples, i.e., if the sample distribution is similar to the population distribution.

Assumptions or Approximations

In this inferential journey, to compute the standard error or to derive the confidence interval of the estimates, we have been making some assumptions and approximations that are deemed reasonable.

For example, it is reasonable to assume a normal distribution for the sample mean \bar{x}.

\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. E[\bar{x}]=\mu.

The standard deviation of the sample mean, or the standard error of the estimate is \frac{\sigma}{\sqrt{n}}.

This visual should be handy.

To derive the confidence interval of the variance and standard deviation, we assumed that \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

Depending on the degrees of freedom, the distribution of \frac{(n-1)s^{2}}{\sigma^{2}} looks like this.

Most recently, we assumed that the estimate for proportion \hat{p} can be approximated by a normal distribution.

\hat{p} \sim N(p, \frac{p(1-p)}{n})

We derived the confidence interval of the population proportion as [\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}], based on this assumption.

Let’s examine this assumption once again.

In a sample of size n, proportion can be estimated as \hat{p} = \frac{S_{n}}{n}, where S_{n} is the number of favorable instances for the thing we are measuring. \hat{p} can be approximated to a normal distribution since S_{n} can be approximated to a normal distribution.

If we take Bernoulli random variables (0,1) for X_{1}, X_{2}, X_{3}, …, X_{n}, S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}, the number of successes, follows a Binomial distribution f(x) = \frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}.

For a large enough sample size n, the distribution function of S_{n} can be well-approximated by the normal distribution.

Let’s do some experiments and see if this is reasonable.

Look at this animation. I am showing the Binomial probability distribution function for p = 0.5 while n increases from 10 to 100.

It looks like an approximation to a normal distribution is very reasonable.

Now, look at these two animations that show the Binomial probability function for p = 0.1 and p = 0.95, i.e., when p is near the boundaries.

Clearly, the distributions exhibit skew and are not symmetric. An approximation to normal distribution even for large values of n, i.e., a big sample, is not appropriate.

How then can we be confident about the standard error and the confidence intervals?

For that matter, how can we derive the standard error or the intervals of a parameter whose limiting form is not known, or mathematically very complicated?

Enter the Bootstrap

Bradley Efron invented a computer-based method, the bootstrap, for estimating the standard error and the confidence intervals of parameters. There is no need for any theoretical calculations or underlying assumptions regarding the mathematical structure or the type of the distribution of the parameter. Instead, bootstrap samples from the data are used.

What is a bootstrap sample?

Suppose we have a data sample, x_{1},x_{2}, … ,x_{n} , a bootstrap sample is a random sample of size n drawn with replacement from these n data points.

Imagine we have the following data: 28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7 as the concentration of Ozone measured in seven locations in New York City.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these seven data points is 1/7, we can randomly draw seven numbers from these seven values.

Think that you are playing the game of Bingo and these seven numbers are chips in your rolling machine. The only difference is, each time you get a number, record it and put it back in the roller until you draw seven numbers. Sample with replacement.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7), some may appear more than one time, and some may not appear at all in a random sample.

I played this using the roller. Here is a bootstrap sample from the original numbers.

As you can see, 28.4 appeared one time, 28.6, 27.5 and 28.7 did not appear, 26.3 appeared 2 times and 27.7 appeared 3 times.

Here are two more bootstrap samples like that.

Basis

The basis for bootstrap samples is that the sample data can be used to approximate the probability distribution function of the population. As you saw before, by putting a probability of 1/n on each data point, we use the discrete empirical distribution \hat{f} as an approximation of the population distribution f.

Take a very simple example of rolling two dice in the game of monopoly. The true probability distribution f of the count (dice 1 + dice 2) is based on the fact that there are 11 possible outcomes and the likelihood of each outcome is the ratio of the total ways we can get the number to 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 is 1/36.

Suppose we roll the dice a hundred times and record the total count, we can use the observed frequencies of the outcomes from this sample data to approximate the actual probability distribution.

Look at these 100 counts as outcomes of rolling two dice 100 times.

The frequency plot shown in black lines closely approximates the true frequency shown in red.

The empirical distribution \hat{f} is the proportion of times each value in the data sample x_{1}, x_{2}, …, x_{n} occurs. The observed frequency \hat{f} is a sufficient statistic for the true distribution f with an assumption that the data have been generated by random sampling from the true distribution f. All the information of the true distribution f is contained in the empirical distribution \hat{f}.

An unknown population distribution f has produced the observed data x_{1}, x_{2}, …, x_{n}. We can use the observed data to approximate f by its empirical distribution \hat{f} and then use the empirical distribution to generate bootstrap replicates of the data. Since f generated x, \hat{f} can be used to generate the bootstrap samples.

f has given x_{1}, x_{2},…, x_{n} can be used to estimate \hat{f} will be used to generate a bootstrapsample.

This is the basis.

Bootstrap Replicates

Once we generate enough bootstrap samples, we can use the estimators (formula to estimate the parameter) on these samples. For example, if we want to represent the true population mean \mu, we can apply the equation for the sample mean \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} on each of these bootstrap samples to generate bootstrap replicates of the mean.

If we want to represent the true population variance using an interval, we can apply s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} on these bootstrap samples to generate replicates of the variance.

Likewise, if we want an interval for the true proportion, we apply \hat{p} = \frac{S_{n}}{n} on the bootstrap samples to get replicates of the proportion.

Each bootstrap sample will produce a replicate of the parameter. Efron prescribes anywhere between 25 to 200 bootstrap replications for a good approximation of the limiting distribution of the estimate. As the number of bootstrap replicates approaches infinity, the standard error as measured by the standard deviation of these replicates will approach the true standard error.

Let’s look at the bootstrap replicates of the sample mean and the sample standard deviation for the Ozone data for which we used the Bingo machine to generate the bootstrap samples. In a later coding lesson, we will learn how to do it using simple functions in RStudio.

For bootstrap sample 1, the sample mean is 27.26 and the sample standard deviation is 0.82.

For bootstrap sample 2, the sample mean is 27.61 and the sample standard deviation is 0.708.

I do this 200 times. Here is how the distribution of the sample mean (\bar{x}) obtained from 200 bootstrap replicates looks like.

Here is the distribution of the sample standard deviation.

Like this, we can develop the intervals of any type of parameters by applying the relevant estimator on the bootstrap sample.

Bootstrap confidence interval

Finally, we can use the percentiles of the bootstrap replicates as the confidence limits of the parameter.

Take a 90% confidence interval for instance. From the bootstrap replicates, we can say that there is a 90% probability that the true mean \mu will be between \bar{x}_{[5]} and \bar{x}_{[95]}, the 5th and the 95th percentiles of the bootstrap replicates.

P(\bar{x}_{[5]} \le \mu \le \bar{x}_{[95]}) = 0.90

For the Ozone example, the 90% confidence interval of the true mean is [27.114, 28.286] and the 90% confidence interval of the true standard deviation is [0.531 1.087].

Look at these plots.

We can define a 100(1-\alpha)% bootstrap confidence interval for the true mean \mu as [l, u] = [\bar{x}_{[\alpha]}, \bar{x}_{[1-\alpha]}].

There are many more uses of the bootstrap, and we will have much fun with it in due course.

But for today, let’s end with Efron’s statement on what other names were suggested to him for his method.

“I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.””

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 78 – To Err is Human: Beware of simplicity

Mumble used t-distribution to estimate the 99% confidence interval of the true vehicle speed on I95.

\bar{x} - t_{0.005,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.0205,(n-1)} \frac{s}{\sqrt{n}}

As he mentioned to his boss, the sample mean (\bar{x}) is 65 mph, the sample standard deviation (s) is 6 mph. For a sample of 50 vehicles (n = 50) the t-critical value denoted using t_{\frac{\alpha}{2},(n-1)} for the 99% confidence interval is 2.679.

Using these values in the formula, we get

65 - 2.679\frac{6}{\sqrt{50}} \le \mu \le 65 + 2.679\frac{6}{\sqrt{50}}

62.73 mph \le \mu \le 67.27 mph

His boss immediately identified that the length of the confidence interval is 67.27 – 62.73 = 4.54 mph and that the margin of error is greater than 2 mph.

You know from last week’s lesson 77 that the margin of error e is half of the length of the confidence interval. Since \bar{x} is the estimate of the true mean \mu, the error in estimating \mu using \bar{x} can be defined as e=|\bar{x}-\mu|.

I am using z_{\frac{\alpha}{2}} and \sigma instead of t_{\frac{\alpha}{2},(n-1)} and s as Mumble will probably make the same simplified assumption that many estimators make; i.e., approximate the interval using normal distribution instead of t-distribution to estimate the sample size for a selected precision.

Half-length of the confidence interval is z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}. The margin of error e should be within this for a 100(1-\alpha)\% confidence interval.

With e = z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} as the bound, Mumble will solve this equation for n.

n = (z_{\frac{\alpha}{2}}\frac{\sigma}{e})^{2}

n is a function of the selected margin of error e, the standard deviation \sigma and the confidence interval 100(1-\alpha)\%.

For greater margins of error, the required sample size n is smaller for a selected confidence interval and standard deviation \sigma.

For larger variability in the data (i.e., larger values of \sigma), the required sample size n increases for a fixed margin of error and specified confidence interval.

For greater confidence intervals (100(1-\alpha)\%), the required sample size increases for a fixed margin of error and specified confidence interval.

For a chosen level of reliability of the estimate, i.e., the chosen confidence interval, if we know how precise we need the interval to be, i.e., if we know the margin of error, we can come up the required sample size.

Now, since his boss gave him an acceptable margin of error of 1 mph, Mumble will use = 1 in this equation.

He will make two main assumptions. As mentioned above, he will use z_{\frac{\alpha}{2}} instead of t_{\frac{\alpha}{2},(n-1)}. In other words, he will use a z-critical of 2.58 instead of t-critical of 2.68. He will use the same value of 6 mph that he found originally as the sample standard deviation for \sigma.

With these assumptions, his new sample size is 240 vehicles. He has to go and collect data from 190 more vehicles in order to keep the margin of error within 1 mph.

The main reason he made these assumptions is that the problem needs a trial and error solution if he goes with t-critical and s since s and t-critical are dependent on n.

Look at how his sample size will vary with different levels of error for three different values of standard deviation, 5 mph, 6 mph, and 7 mph, still using z-critical instead of t-critical.

An increase in the sample size decreases the standard deviation. So by fixing the standard deviation at 6 mph, the one he obtained with 50 vehicles (smaller sample size), he is overestimating the new required sample size. Of course, some error still remains here since t-critical is wider than z-critical. Moreover, if the data is skewed with outliers or heavy tails, the estimate of sample standard deviation will be inflated and the resulting n will be large. Now, if the cost of collecting more data is not an issue, Mumble will probably not be that worried about this level of error.

Nevertheless, Beware of the assumptions. A lot of simplification goes into it.

I hope there is a way for computing the confidence intervals of different parameters without making a lot of assumptions, z, t, \chi.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 76 – What is your confidence in polls?

Looking at these two recent polls, one wonders which one is truer and what confidence we can have in polls.

Of course, that is a rhetorical question. The real question is if these polls are measuring proportions (or probability) of people agreeing to something, what is the confidence interval of the true proportion or probability.

For example, the first poll tells us that 16% of Americans overall said that they would like to permanently move to another country if they could. The second poll tells us that 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults. If so, what would be the confidence interval for the true proportion of all Americans who would answer these questions?

In lessons 71 to 75, we learned how to derive the confidence interval of the true mean, true variance, and true standard deviation, i.e., the population mean and population variance and standard deviation.

There are many applications, as these example polls, that require estimation of proportions or the probability of occurrence of events. We already know that the maximum likelihood estimator of p, the probability of an event is \hat{p}=\frac{r}{n} where r is the number of successes (times the event happened) in a sample of size n. In other words, the probability can be estimated as the proportion of occurrence in a Bernoulli sequence.

If \hat{p} is the estimate of the proportion, with a few assumptions, we can derive the confidence interval of the true proportion p.

Let’s learn how to do this using a simple polling exercise.

Assume that we want to estimate through a poll, the proportion of people who want to move out of the U.S. It will not be possible to ask everyone whether or not they will move out. However, we can take a sample, i.e., select a subset of the population and ask them for their preference.

In a sample of size n, the preference of the people can be represented by Bernoulli random variables X_{1}, X_{2}, X_{3}, …, X_{n} where X_{i} = 1 if a person wants to move out and 0 otherwise. If S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}, the proportion of people who wish to move out can be estimated as \hat{p} = \frac{S_{n}}{n}.

By now, you must be familiar that \hat{p} is a random variable since the estimate can change with a change in the sample. What assumption can we make for the distribution function of this random variable?

Since S_{n} is the sum of n independent random variables, for a large enough sample size n, the distribution function of S_{n} can be well-approximated by the normal distribution. Further, since \hat{p} is a linear function of S_{n}, the random variable \hat{p} can also be assumed to be normally distributed.

\hat{p} \sim N(E[\hat{p}], V[\hat{p}])

We can standardize \hat{p} and relate it to the standard normal distribution Z.

Z = \frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}} \sim N(0,1)

Before we proceed to derive the confidence interval, we should first derive the expected value and the variance of \hat{p}.

Expected Value E[\hat{p}]

\hat{p} = \frac{1}{n}\sum_{i=1}^{n}X_{i}

E[\hat{p}] = E[\frac{1}{n}\sum_{i=1}^{n}X_{i}] = \frac{1}{n}\sum_{i=1}^{n}E[X_{i}]

Since X_{i} is a Bernoulli distribution, E[X_{i}]=1(p) + 0(1-p) = p

E[\hat{p}]  = \frac{1}{n}\sum_{i=1}^{n}p

E[\hat{p}]  = \frac{1}{n}np = p

Variance V[\hat{p}]

V[\hat{p}] = V[\frac{1}{n}\sum_{i=1}^{n}X_{i}]

V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}V[X_{i}]

V[X_{i}] = E[X_{i}^{2}] - (E[X_{i}])^{2}

E[X_{i}^{2}] = 1^{2}(p) + 0^{2}(1-p) = p

V[X_{i}] = p - p^{2} = p(1-p)

So,

V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}p(1-p)

V[\hat{p}] = \frac{1}{n^{2}}np(1-p)

V[\hat{p}] = \frac{p(1-p)}{n}

Confidence Interval of \hat{p}

Now, if we are interested in the 95% confidence interval of the true estimate p, we can use the standardized version of \hat{p} to say that there is a 95% probability that the standard normal variable \frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}} is between -1.96 and 1.96.

P(-1.96 \le \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \le 1.96) = 0.95

We can rearrange this to obtain,

P(\hat{p} -1.96\sqrt{\frac{p(1-p)}{n}} \le p \le \hat{p} + 1.96\sqrt{\frac{p(1-p)}{n}}) = 0.95

We can use \hat{p} in place of p for the variance term.

This interval [\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}] is called the 95% confidence interval of the population proportion p. The interval itself is random since it is derived from \hat{p}. A different sample will have a different \hat{p} and hence a different interval or range.

There is a 95% probability that this random interval [\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}] contains the true value of p.

Put another way, if we use this method to estimate the confidence interval of p for a large number of samples we can expect that in about 95% of the samples the true value of p will be within the confidence interval obtained from the sample.

Let’s now compute the 95% confidence interval for the proportions we saw in the two polls. 16% of Americans overall said that they would like to permanently move to another country if they could. 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults.

95% confidence interval for the proportion of people who want to move out of the U.S.

[\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]

[0.16 - 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}, 0.16 + 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}]

[0.137 \le p \le 0.183]

is the 95% confidence interval of the true proportion.

95% confidence interval for the proportion of people that believe that 2019 will be a year of full or increasing employment.

[0.59 - 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}, 0.59 + 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}]

[0.56 \le p \le 0.62]

is the 95% confidence interval of the true proportion.

We can generalize this to any confidence level by defining a 100(1-\alpha)% confidence interval for the true proportion p as [\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]

Keep in mind that the main assumption behind this is that the estimate \hat{p} can be approximated by a normal distribution for a reasonably large sample size.

How do we know what size of the sample is sufficient? In the first graphic that showed the polls, I highlighted margin of errors. Can you guess what that is?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 75 – Fiducia on the variance

This morning officer Jones is deployed to NY state highway 17A to monitor the speeding vehicles. Its been a rather damp morning; his computer has recorded the speeds of only 20 vehicles so far.

He is nicely tucked away in a corner where jammers cannot find him. With no vehicles in sight, he started playing with the data.

He computes the average vehicle speed.

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}

Mean vehicle speed \bar{x} = 50.10 miles per hour.

But officer Jones is a variance freak. He computes the sample variance and the sample standard deviation.

s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}

The sample variance is 5.9 mph^{2}, and the sample standard deviation is 2.43 mph.

I wonder how large this deviation can get,” thought officer Jones.

Can we answer his question?

The answer lies in knowing the confidence interval on the variance.

We now know how to construct the confidence interval on the true mean based on the sample.

The interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}] is the 100(1-\alpha)\% confidence interval of the population mean \mu.

The basis for this is the idea that we can find an interval that has a certain probability of containing the truth.

In other words, a 100(1-\alpha)\% confidence interval for a parameter \theta is an interval [l, u] that has the property P(l \le \theta \le u) = 1-\alpha.

We can apply the same logic to the variance. Office Jones is wondering about the true standard deviation \sigma or true variance \sigma^{2}. He is wondering if he can derive the interval for the true variance given his limited data.

How can we derive the confidence interval on the variance?

Remember what we did for deriving the confidence interval on the mean. We investigated the limiting distribution of \bar{x}, the sample mean and used the probability statement P(l \le \theta \le u) = 1-\alpha on this distribution to derive the end-points, the upper and lower confidence limits.

Let’s do the same thing here.

What is the limiting distribution for sample variance?

.

.

.

Those who followed lesson 73 carefully will jump up to say that it is related to the Chi-square distribution.

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

Okay, no problem. We will do it again.

Let’s take the equation of the sample variance s^{2} and find a pattern in it.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}

Move the n-1 over to the left-hand side and do some algebra.

(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}

(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}

(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}

(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}

Let’s divide both sides of the equation by \sigma^{2}.

\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}

The right-hand side now is the sum of squared standard normal distributions.

\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2} ; sum of squares of (n - 1) standard normal random variables.

As we learned in lesson 53, if there are n standard normal random variables, Z_{1}, Z_{2}, ..., Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom.

Its probability density function is f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi > 0 and 0 otherwise.

Since we have \frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

Depending on the degrees of freedom, the distribution of \frac{(n-1)s^{2}}{\sigma^{2}} looks like this.

Smaller sample sizes imply lower degrees of freedom and the distribution will be highly skewed; asymmetric.

Larger sample sizes or higher degrees of freedom will tend the distribution to symmetry.


We can now apply the probability equation and define a 100(1-\alpha)\% confidence interval for the true variance \sigma^{2}.

P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha

\alpha is between 0 and 1. You know that a 95% confidence interval means \alpha = 0.05, and a 99% confidence interval means \alpha = 0.01.

\chi_{l,n-1} and \chi_{u,n-1} are the lower and upper critical values from the Chi-square distribution with n-1 degrees of freedom.

The main difference here is that the Chi-square distribution is not symmetric. We should calculate both the lower limit and the upper limit that correspond to a certain level of probability.

Take for example, a 95% confidence interval. We need \chi_{l,n-1} and \chi_{u,n-1}, the quantiles that yield a 2.5% probability in the right tail and 2.5% probability in the left tail.

P(\chi \le \chi_{l,n-1}) = 0.025

and

P(\chi > \chi_{u,n-1}) = 0.025

Going back to the probability equation
P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha

With some rearrangement within the inequality, we get

P(\frac{(n-1)s^{2}}{\chi_{u,n-1}} \le \sigma^{2} \le \frac{(n-1)s^{2}}{\chi_{l,n-1}}) = 1-\alpha

The interval [\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}] is called the 100(1-\alpha)\% confidence interval of the population variance \sigma^{2}.

We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation.

The interval [\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}] is called the 100(1-\alpha)\% confidence interval of the population standard deviation \sigma.

Let’s now go back to the data and construct the confidence interval on the variance and the standard deviation.

Officer Jones has a sample of 20 vehicles. n=20. The sample variance s^{2} is 5.9 and the sample standard deviation s is 2.43.

Since n = 20, \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with 19 degrees of freedom. The lower and upper critical values at the 95% confidence interval \chi_{l,19} and \chi_{u,19} are 8.90 and 32.85.

You must be wondering about the tediousness in computing these quantiles each time there is a sample.

That problem is solved for us. Like the t-table, the upper tail probabilities of the Chi-square distribution for various degrees of freedom are tabulated in the Chi-square table.

You can get them from any statistics textbooks, or a simple internet search on “Chi-square table” will do. Here is an example.

The first column is the degrees of freedom. The subsequent columns are the computed values for P(\chi > \chi_{u}), the right tail probability.

For our example, say we are interested in the 95% confidence interval, we look under df = 19 and identify \chi^{2}_{0.975} = 8.90 as the lower limit that yields a probability of 2.5% on the left tail, and \chi^{2}_{0.025} = 32.85 as the upper limit that yields a probability of 2.5% on the right tail.

Substituting these back into the confidence interval equation, we get

P(\frac{19*5.90}{32.85} \le \sigma^{2} \le \frac{19*5.90}{8.90}) = 0.95

The 95% confidence interval for the true variance is [3.41, 12.59].

If we take the square root of these limits, we get the 95% confidence interval for the true standard deviation.

P(1.85 \le \sigma \le 3.54) = 0.95

The 95% confidence interval for the true variance is [1.85 3.54].

So, to finally answer office Jones’ question, at the 95% confidence level, the deviation in vehicle speed can be as large as 3.54 mph and as low as 1.85 mph. The interval itself is random. As you know by now the real interpretation is:

If we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample variance \sigma^{2} and the Chi-square critical limits from the Chi-square distribution, in the long-run, 95% of these intervals will contain the true value of \sigma^{2}.

A few vehicles must have sneaked by in the short run.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 74 – Deriving confidence from t


What we know so far

The 100(1-\alpha)\% confidence interval for the true mean \mu is [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]

If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown \sigma.

However, for small sample sizes, the sample standard deviation s is itself subject to error. In other words, it may be far from the true value of \sigma. Hence, we cannot assume that \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} will tend to a normal distribution, which was the basis for deriving the confidence intervals in the first place.

Last week’s mathematical excursion took us back in time, introduced us to “Student” and the roots of his famous contribution, the Student’s t-distribution.

“Student” derived the frequency distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} to be a t-distribution with (n-1) degrees of freedom. We paddled earnestly through a stream of functions to arrive at this.

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}

The probability of t within any limits is fully known if we know n, the sample size of the experiment. The function is symmetric with an expected value and variance of: E[T] = 0 and V[T] = \frac{n-1}{n-3}.

While the t-distribution resembles the standard normal distribution Z, it has heavier tails, i.e., it has more probability in the tails than the normal distribution. As the sample size increases (n \to \infty) the t-distribution approaches Z.


Can we derive the confidence interval from the t-distribution?

Let’s follow the same logic used to derive the confidence interval from the normal distribution. The only difference now will be that

\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t

Suppose we are interested in deriving the 95% confidence interval for the true mean \mu, we can use the probability rule

P(-t_{0.025,(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{0.025,(n-1)}) = 0.95

It is equivalent to saying that there is a 95% probability that the variable \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} is between -t_{0.025,(n-1)} and t_{0.025,(n-1)}, the 2.5 percentile value of t.

t_{0.025,(n-1)} is like Z_{0.025}. While Z_{0.025} = -1.96, t_{0.025,(n-1)} will depend on the sample size n.

Notice I am using (n-1), the degrees of freedom in the subscript for t to denote the fact that the value will be different for a different sample size.

To generalize, we can define a 100(1-\alpha)\% confidence interval for the true mean \mu using the probability equation

P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha

\alpha is between 0 and 1. For a 95% confidence interval, \alpha = 0.05, and for a 99% confidence interval, \alpha = 0.01. Like how the Z-critical value is denoted using Z_{\frac{\alpha}{2}}, the t-critical value can be denoted using t_{\frac{\alpha}{2},(n-1)}.

We can modify the inequality in the probability equation to arrive at the confidence interval.

P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha

Multiplying throughout by \frac{s}{\sqrt{n}}, we get

P(-t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}} \le \bar{x}-\mu \le t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha

Subtracting \bar{x} and multiplying by -1 throughout, we get

P(\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha

This interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}] is called the 100(1-\alpha)\% confidence interval of the population mean.

As we discussed before in lesson 72, the interval itself is random since it is derived from \bar{x} and s. A different sample will have a different \bar{x} and s, and hence a different interval or range.


Solving Jenny’s problem

Let us develop the 95% confidence intervals of the mean water quality at the Rockaway beach. Jenny was using the data from the ROCKAWAY BEACH 95TH – 116TH. She identified 48 data points (n = 48) from 2005 to 2018 that exceeded the detection limit.

The sample mean \bar{x} is 7.9125 counts per 100 ml. The sample standard deviation s is 6.96 counts per 100 ml.

The 95% confidence interval of the true mean water quality (\mu) is

\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}

How can we get the value for t_{0.025,(n-1)}, the 2.5 percentile from the t-distribution with (n-1) degrees of freedom?

.

.

.

You must have started integrating the function f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} into its cumulative density function F(t).

Save the effort. These are calculated already and are available in a table. It is popular as the t-table. You can find it in any statistics textbook, or simply type “t-table” in any search engine and you will get it. There may be slight differences in how the table is presented. Here is an example.

This table shows the right-sided t-distribution critical value t_{\frac{\alpha}{2},(n-1)}. Since the t-distribution is symmetric, the left tail critical values are -t_{\frac{\alpha}{2},(n-1)}. You must have noticed that the last row is indicating the confidence level 100(1-\alpha)\%.

Take, for instance, a 95% confidence interval, \frac{\alpha}{2}=0.025. The upper tail probability p is 0.025. From the table, you look into the sixth column under 0.025 and slide down to df=n-1. For instance, if we had a sample size of 10, the degrees of freedom are df = 9 and the t-critical (t_{0.025,9}) will be 2.262; like this:

Since our sample size is 48, the degrees of freedom df = 47.

In the sixth column under upper tail probability 0.025, we should slide down to df = 47. Since there is no value for df = 47, we should interpolate from the values 2.021 (df = 40) and 2.009 (df = 50).

The t-critical value for 95% confidence interval and df = 47 is 2.011. I got it from R. We will see how in an R lesson later.

This table is also providing Z_{\frac{\alpha}{2}} at the end. See z^{*}=1.96 for a 95% confidence interval.

Let’s compute the 95% confidence interval for the mean water quality.

\bar{x} - 2.011\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + 2.011\frac{s}{\sqrt{n}}

7.9125 - 2.011\frac{6.96}{\sqrt{48}} \le \mu \le 7.9125 + 2.011\frac{6.96}{\sqrt{48}}

5.892 \le \mu \le 9.933

Like in the case of the confidence interval derived from the normal distribution, if we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample mean (\bar{x}), sample standard deviation (s) and the t-critical from the t-distribution, in the long-run, 95% of these intervals will contain the true value of \mu.

Here is how eight different 95% confidence intervals look relative to the truth. These eight intervals are constructed based on the samples from eight different locations. In the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

I am also showing the confidence intervals derived from the normal distribution and known variance assumption (\sigma=9.95). They are in green color.

Can you spot anything?

How do they compare?

What can we learn about the width of the intervals derived from the normal distribution (Z) and the t-distribution?

Is there anything that is related to the sample size?

Think about it until we come back with a lesson in R for confidence intervals.

There are other applications of the t-distribution that we will learn in due course of time.

Remember that you will cross a “t” whenever there is error.

I will end with these notes from Fisher in his paper “Student” written in 1939 in remembrance of W.S. Gosset.

“Five years, however, passed, without the writers in Biometrika, the journal in which he had published, showing any sign of appreciating the significance of his work. This weighty apathy must greatly have chilled his enthusiasm.”

“The fruition of his work was, therefore, greatly postponed by the lack of appreciation of others. It would not be too much to ascribe this to the increasing dissociation of theoretical statistics from the practical problems of scientific research.”

It is now 110 years since he published his famous work using a pseudo name “Student.” Suffice it to say that “Student” and his work will still be appreciated 110 years from now, and people will derive confidence from the “t.”

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 73 – Learning from “Student”

On the fifteenth day of July 2018, Jenny and Joe discussed the confidence interval for the mean of the population.

The 100(1-\alpha)% confidence interval for the true mean \mu is [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}].

\bar{x} is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean \mu and true standard deviation \sigma.

The end-points [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] are called the lower and the upper confidence limits.

While developing the confidence interval of the mean water quality at the beach with a sample of 48, Jenny pointed out that the value for the true standard deviation, \sigma is not known.

Joe collected a much larger sample (all data points available for the Rockaway beach) and computed an estimate for the true standard deviation. Based on the principle of consistency, he suggested that this estimate is close to the truth, so can be used as \sigma.

Jenny rightly pointed out that not always we will have such a large sample. Most often, the data is limited.

Joe’s suggestion was to use sample standard deviation, s, i.e., the estimated standard deviation from the limited sample in place of \sigma. However, Jenny was concerned that this will introduce more error into the estimation of the intervals.

March 1908

This was a concern for W. S. Gosset (aka “Student”) in 1908. He points out that one way of dealing with this difficulty is to run the experiment many times, i.e., collect a large enough sample so that the standard deviation can be computed once for all and used for subsequent similar experiments.

He further points out that there are numerous experiments that cannot easily be repeated often. In the situation where a large sample cannot be obtained, one has to rely on the results from the small sample.

The standard deviation of the population (\sigma) is not known a priori.

The confidence interval Joe and Jenny derived is based on the conjecture that the distribution of the sample mean (\bar{x}) tends to a normal distribution. \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}), and if the sample size is large enough, it would be reasonable to substitute sample standard deviation s in place of the population standard deviation \sigma. But it is not clear how “large” the sample size should be.

The sample standard deviation can be computed as s=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}. While s is a perfectly good estimate of \sigma, it is not equal to \sigma.

“Student” pointed out that when we substitute s for \sigma, we cannot just assume that \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} will tend to a normal distribution just like Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1).

He derived the frequency distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} in his landmark paper “The Probable Error of a Mean.”

This distribution came to be known as the Student’s t-distribution.

In his paper, he did not use the notation t though. He referred to it as quantity z, obtained by dividing the distance between the mean of a sample and the mean of the population by the standard deviation of the sample.

He derived the distribution of the sample variance and the sample standard deviation, showed that there is no dependence between s and \bar{x} and used this property to derive the joint distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}.

Today, we will go through the process of deriving the probability distributions for the sample variance s^{2}, sample standard deviation s and the quantity t.

t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}

It is important that we know these distributions. They will be a recurring phenomenon from now on and we will be using them in many applications.

I am presenting these derivations using standard techniques. There may be simpler ways to derive them, but I found this step by step thought and derivation process enriching.

During this phase, I will refer back to “Student’s” paper several times. I will also use the explanations given by R.A Fisher in his papers on “Student.”

You can follow along these steps using a pen and paper, or you can just focus on the thought process and skip the derivations, either way, it is fun to learn from “Student.” Trust me.


t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}

To know the distribution of t, we should know the distributions of \bar{x} and s.

We already know that \bar{x} tends to a normal distribution; \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}), but what is the limiting distribution of s, i.e., what is the probability distribution function f(s) of the sample standard deviation?

To know this, we should first know the limiting distribution of the sample variance s^{2}, from which we can derive the distribution of s.

What is the frequency distribution of the sample variance?

Expressing variance as the sum of squares of normal random variables.

Let’s take the equation of the sample variance s^{2} and see if there is a pattern in it.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}

Move the n-1 over to the left-hand side and do some algebra.

(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}

(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}

(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}

(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}

Let’s divide both sides of the equation by \sigma^{2}.

\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}

The right-hand side now looks like the sum of squared standard normal distributions.

\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

Sum of squares of (n – 1) standard normal random variables.

Does that ring a bell? Sum of squares is the language of the Chi-square distribution. We learned this in lesson 53.

If there are n standard normal random variables, Z_{1}, Z_{2}, ..., Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom. Its probability density function is f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi > 0 and 0 otherwise.

Sine we have \frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n-1} with a probability distribution function

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

The frequency distribution of the sample variance.

It turns out that, with some modification, this equation is the frequency distribution f(s^{2}) of the sample variance.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}(\frac{(n-1)s^{2}}{2\sigma^{2}})^{\frac{n-3}{2}}}{(\frac{n-3}{2})!} e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(\frac{n-1}{\sigma^{2}})^{\frac{n-1}{2}}\frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2^{\frac{n-3}{2}}}(s^{2})^{\frac{n-3}{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}[\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}]

The above equation can be viewed as f(aX) = \frac{1}{a}f(X) where X = s^{2} and a = \frac{n-1}{\sigma^{2}}.

These few steps will clarify this further.

Let Y = aX

P(Y \le y) = P(aX \le y)

P(Y \le y) = P(X \le \frac{1}{a}y)

F_{Y}(y) = F_{X}(\frac{1}{a}y)

\frac{d}{dy}(F(y)) = \frac{d}{dy}F_{X}(\frac{1}{a}y)

Applying the fundamental theorem of calculus and chain rule together, we get,

f(y) = \frac{1}{a}f(\frac{1}{a}y)

f(aX) = \frac{1}{a}f(X)

Hence, the frequency distribution of s^{2} is

f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

From f(s^{2}), we can derive the probability distribution of s, i.e., f(s).

WHAT IS THE FREQUENCY DISTRIBUTION OF THE SAMPLE Standard Deviation?

Here, I refer you to this elegant approach by “Student.”

“The distribution of s may be found from this (s^{2}), since the frequency of s is equal to that of s^{2} and all that we must do is to compress the base line suitably.”

Let f_{1} = f(s^{2}) be the distribution of s^{2}.

Let f_{2} = f(s) be the distribution of s.

Since the frequency of s^{2} is equal to that of s, we can assume,

f_{1}ds^{2} = f_{2}ds

In other words, the probability of finding s^{2} in an infinitesimal interval ds^{2} is equal to the probability of finding s in an infinitesimal interval ds.

We can reduce this using substitution as

2sf_{1}ds = f_{2}ds

or

f_{2} = 2sf_{1}

The frequency distribution of s is equal to 2s multiplied by the frequency distribution of s^{2}.

Hence, the frequency distribution of s is

f(s) = 2s\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

WHAT IS THE FREQUENCY DISTRIBUTION OF t?

We are now ready to derive the frequency distribution of t.

Some of the following explanation can be found in R. A. Fisher’s 1939 paper. I broke it down step by step for our classroom.

t = \frac{\bar{x}-\mu}{s/\sqrt{n}}

\bar{x} - \mu = \frac{st}{\sqrt{n}}

For a given value of s, d\bar{x} = \frac{s}{\sqrt{n}}dt

The frequency distribution of \bar{x} is

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(\bar{x}-\mu)^{2}}

Remember, \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}).

Substituting \bar{x} - \mu = \frac{st}{\sqrt{n}}

The distribution becomes

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(st/\sqrt{n})^{2}}

or

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}

For a given value of s, the probability that \bar{x} will be in d\bar{x} is

df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}d\bar{x}

Substituting d\bar{x} = \frac{s}{\sqrt{n}}dt, we can get

df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \frac{s}{\sqrt{n}}dt

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} s dt

Fisher points out that for all values of s, we can substitute the frequency distribution of s and integrate it over the interval 0 and \infty. This can be done because \bar{x} and s are independent.

So the joint distribution becomes

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} s dt e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

In the next few steps, I will rearrange some terms to convert the integral into a recognizable form.

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} dt \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}s \frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}} \frac{1}{\sigma^{n-1}}dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}\frac{(n-1)^{\frac{n}{2}}}{(n-1)^{1/2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-2}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{n -1 + t^{2}}{2\sigma^{2}}s^{2}}ds

Hang in there. We will need some more concentration!

Let k = \frac{n - 1 + t^{2}}{2\sigma^{2}}

The equation becomes

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-ks^{2}}ds

Some more substitutions.

Let p = ks^{2}

Then dp = 2ksds

The equation can be reduced as

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-p}\frac{1}{2ks}dp

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}\frac{1}{2k}s^{n-2} e^{-p}dp

Since s = \frac{p^{1/2}}{k^{1/2}}

The equation becomes,

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2k^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

Replacing k = \frac{n - 1 + t^{2}}{2\sigma^{2}}

The equation becomes

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2(\frac{n - 1 + t^{2}}{2\sigma^{2}})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

Some more reduction …

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2\sigma^{2})^{n/2}}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2^{n/2}\sigma^{n})}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}} dt \frac{1}{(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(\frac{n-1+t^{2}}{n-1})^{\frac{n}{2}}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} \int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

The integral you see on the right is a Gamma function that is equal to (\frac{n-2}{2})! for positive integers.

There we go, the t-distribution emerges

df = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} dt

The probability distribution of t is

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}


It is defined as t-distribution with (n-1) degrees of freedom. As you can see, the function only contains n as a parameter.

The probability of t within any limits is fully known if we know n, the sample size of the experiment.

“Student” also derived the moments of this new distribution as

E[T] = 0

V[T] = \frac{n-1}{n-3}

The function is symmetric and resembles the standard normal distribution Z.

The t-distribution has heavier tails, i.e. it has more probability in the tails than the normal distribution. As the sample size increases (i.e., as n \to \infty) the t-distribution approaches Z.

You can look at the equation and check out how it converges to Z in the limit when n \to \infty.

These points are illustrated in this animation.

I am showing the standard Z in red color. The grey color functions are the t-distributions for different values of n. For small sample sizes, the t-distribution has fatter tails — as the sample size increases, there is little difference between T and Z.

😫😫😫

I am pretty sure you do not have the energy to go any further for this week. I don’t too.

So here is where we stand.

f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

and finally

The probability distribution of t is

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}

See you next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 72 – Jenny’s confidence, on the average

Meet Jenny. Jenny is bright and intelligent and is known as “the problem solver” among her friends. She usually goes unnoticed in the crowd due to her calm and composed nature, but by god, she is assertive when it is most needed. Her analytical neurons are razor sharp and you cannot mumbo-jumbo with her. She loves science fiction, history, and occasional Woody Allen. Oh, and she likes swimming, surfing, summer, and beaches.

Her second summer Rockaway beach trip is coming up and she is excited. With all the surfing kit packed, she pulled up the NYC Beach Water Quality website to monitor the status. “The Enterococci Bacteria Count is within limit!” she thought.

She got busy with other things, but the thought of beach samples bothered her. “Where do they take these samples from? They show Rockaway beach, but does the sample represent the whole beach? How many samples do they take? If they use one sample, how do we know what the truth is? I have been to this place several times, I wonder what the true water quality of the beach is?” These questions kept her awake. I told you she is a problem solver.

The next morning she met Joe to discuss the problem. Friends of this classroom know who Joe is. He is now the resident expert on statistical topics.


So, you want to know the true water quality from the sample data. Why don’t you develop the confidence interval of the mean water quality? This will at least give you a range, an interval where the true water quality will be.

 

How can we do that using the sample? Or, maybe I should ask, can you explain what is an interval and how to derive it.

 

Okay. Let’s get some simple/sample data first. NYC Open Data should definitely have data on beaches. Here, they have a link for DOHMH Beach Water Quality Data. The description says that this is the data of water quality sample results collected by the Department of Health and Mental Hygiene from all New York City Beaches. Amazing! Which Rockaway beach are you going to?

 

I usually go to the one on the 95th Street and stay north.

 

We can take the data from ROCKAWAY BEACH 95TH – 116TH then. Let me show you how to write a code in R to extract this subset of the data.

 

 

Hey, I can do that with ease.

 

I forgot to tell you that Jenny is a member of the local girls who code club. She pulls up her Macbook and types a few lines. Meanwhile, Joe does some coding too on the same data.

There are samples that had a result below the detection limit. If I remove those from the ROCKAWAY BEACH 95TH – 116TH data, we are left with 48 data points collected at various times from 2005 to 2018. Look.

 

Wonderful. I am guessing that the blue color triangle in the histogram is the sample mean (\bar{x}).

 

Yes. This is an estimate (good guess) for the average Enterococci Bacteria count for this part of the beach based on 48 samples over several years.

We can describe the uncertainty in this estimate using the confidence intervals. The key link is to realize that this estimate, i.e., the sample mean (\bar{x}) is a random variable. For instance, if we were to take another data sample from different times of the year or over different parts of the beach, we would get a slightly different sample mean. The value of the estimate changes with the change of sample, so we can think of estimate as a range of values or a probability distribution.

 

Yes, that makes perfect sense, but what probability distribution does it follow?

 

The sample mean is given by the equation \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.
x_{i}s are independent and identically distributed random samples. Look at the equation carefully, it is the summation of random variables. Convolution. We learned in Lesson 48 that for a large enough sample size, this summation will converge to the normal distribution — Central Limit Theorem. As the samples grow (n becomes large), convolution or function multiplications yield a smooth center heavy and thin-tailed bell function — the normal density.

 

Ah, I see. So it is reasonable to assume a normal distribution for the sample mean \bar{x}.

 

 

Yes, \bar{x} follows a normal distribution with an expected value E[\bar{x}] and variance V[\bar{x}].

 

E[\bar{x}]=\mu. The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. This is in Lesson 67.

The variance of the sample mean V[\bar{x}] =\frac{\sigma^{2}}{n}. We derived this in Lesson 68. Variance tells us how widely the estimate is distributed around the center of the distribution.

The standard deviation of the sample mean, or the standard error of the estimate is \frac{\sigma}{\sqrt{n}}.

 

So, \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})

 

Jenny looks at the equation for a bit. She takes a piece of paper, draws on it and instantly types a few lines of code.

 

Here, I am showing this visually.

 

\bar{x} is a normal distribution. It is centered on \mu with a standard deviation of \frac{\sigma}{\sqrt{n}}. One standard deviation range is \mu \pm \frac{\sigma}{\sqrt{n}} , two standard deviations range is \mu \pm 2\frac{\sigma}{\sqrt{n}} and three standard deviations range is \mu \pm 3\frac{\sigma}{\sqrt{n}} .

 

Yup. The standard normal way of saying this is Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1).

 

 

The Z-score!

 

 

Yes, let me ask you something. What is the area under the normal density curve between -1.96 and 1.96?

Looking up from the standard normal tables, we get P(Z \le -1.96) = 0.025, which means the area on the right side of the tail P(Z \ge 1.96) = 0.025. The area between -1.96 and 1.96 is 0.95. 95%.

 

There is a 95% probability that the standard normal variable \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} is between -1.96 and 1.96.

P(-1.96 \le \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \le 1.96) = 0.95

I will modify the inequality in this equation.

Multiplying throughout by \frac{\sigma}{\sqrt{n}}, we get P(-1.96 \frac{\sigma}{\sqrt{n}} \le \bar{x}-\mu \le 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95

Subtracting \bar{x} and multiplying by -1 throughout, we get

P(\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95

We derived that the probability of the true population mean \mu lying between two end-points \bar{x} - 1.96\frac{\sigma}{\sqrt{n}} and \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}} is 0.95.

This interval [\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}] is called the 95% confidence interval of the population mean. The interval itself is random since it is derived from \bar{x}. As we dicsused before, a different sample will have a different \bar{x} and hence a different interval or range.

There is a 95% probability that this random interval [\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}] contains the true value of \mu.

 

Neat. So to generalize this to any confidence interval, we can replace 1.96 with a Z-critical value.

 

Yes. For instance, if we want a 99% confidence interval, we should find the Z-critical value that gives 99% area under the normal density curve.

 

 

That would be 2.58. So the 99% confidence interval for true mean \mu is \bar{x} - 2.58\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 2.58 \frac{\sigma}{\sqrt{n}}.

Usually, this % confidence interval is described using a confidence coefficient 1-\alpha where \alpha is between 0 and 1. For a 95% confidence interval, \alpha = 0.05, and for a 99% confidence interval, \alpha = 0.01. The Z-critical value is denoted using Z_{\frac{\alpha}{2}}.

 

Yes yes. For 95% confidence interval, Z_{\frac{\alpha}{2}} = Z_{0.025} = 1.96.

In summary, we can define a 100(1-\alpha)% confidence interval for the true mean \mu as

[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]

 

Correct. \bar{x} is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean \mu and true standard deviation \sigma. The end-points [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] are called the lower and upper confidence limits.


 

Let us develop the confidence intervals of the mean water quality. The sample I have has n = 48 data points. The sample mean is \bar{x}=7.9125 counts per 100 ml.

.
.
.
Wait, we don’t know the value of \sigma, the true standard deviation. There is still an unknown in the equation.

 

Assume it is 9.95 counts per 100 ml.

 

 

 Where did that come from.

 

 

Take my word for now and get the confidence interval.

 

Jenny reluctantly does some calculations on a piece of paper.

The 95% confidence interval for the mean water quality is

\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96\frac{\sigma}{\sqrt{n}}.

7.9125 - 1.96\frac{9.95}{\sqrt{48}} \le \mu \le 7.9125 + 1.96\frac{9.95}{\sqrt{48}}.

7.064 \le \mu \le 8.761

 

There is a 95% probability that the true mean water quality will be between 7.064 and 8.761. Now tell me where the 9.95 came from.

 

Well, strictly speaking, your statement should have been, “there is a 95% probability of selecting a sample for which the confidence interval will contain the true value of \mu.

 

 

 

Let me explain. 7.064 \le \mu \le 8.761 can be true or can be false depending on the sample we obtained. \bar{x} is a random variable, a probability distribution whose mean is \mu. Our sample mean represents a random draw from this distribution. So, if we got a sample whose mean is close to the truth, the interval [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] will contain the truth. If we got a sample who mean is somewhat far away from the truth, its confidence interval may not contain the truth. This graphic should make it more clear. The green interval does not contain the true value. The purple interval does. Both these sample means (\bar{x}) are equally likely; it depends on what sample data we get.

Hmm. So, we should think of this phenomenon as a long-run relative frequency outcome. If we have a lot of different samples, and compute the 95% confidence intervals for these samples, in the long-run, 95% of these intervals will contain the true value of \mu.

 

Exactly. The 95% confidence level is what would happen if a large number of random intervals were constructed; not for any particular interval.

So while you were coding to get the subset of data from the ROCKAWAY BEACH 95TH – 116TH, I downloaded all the data that had ROCKAWAY BEACH in the file. There are eight locations along the beach where these samples are taken. Assuming that these are eight different samples, I developed the 95% confidence intervals for each of these samples. Here is how they look relative to the truth. One of them does not contain the true \mu. Like this, in the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

 

How did you get the true value for \mu?

 

 

Based on the principle of consistency.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0.

As n approaches infinity, the sample estimate approaches the true parameter. I took data from the eight beach locations and computed the overall mean and overall standard deviation. While they are not exactly the true values, based on the idea of consistency, we can assume that they are.

\mu = 8.97 and \sigma = 9.95 counts per 100 ml.

 

There you go. I see why you insisted on using \sigma = 9.95.

 

 

You are welcome!

 

 

But not always we will have such a large sample. Most often, the data is limited. How can we know what the true standard deviation is?

 

In that case, you can use the sample standard deviation. In your sample data case it was 6.96 counts per 100 ml, I think.

 

 

Hmm, but that is very different from the true value. Aren’t we inducing more error into the estimation of the intervals?

 

 

Perhaps. Hey, do you want to take a break? Can I buy you a drink? Is Guinness okay?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.