Lesson 80 – Riding with confidence, in R: Week 1

Monday, February 11, 2019

In which Jenny shows her students how to compute and visualize confidence intervals in R, and sends them off on a task.

“Did you know that the symbol of the department of parks and recreations is a cross between the leaf of the London plane and a maple leaf?” asked Jenny as she opened her high school data class. She volunteered to help 40 high school kids recently.

The obvious response from the kids is no response.

“It is,” she said. “More than 10% of the trees are London Planetrees. They grow up to 20 – 30 meters, with a trunk circumference of 3 meters.”

She projected a map on the big screen.

“These are the places I randomly selected to visit and record the diameter of the trunk of the London planetrees. You have this data in the handout. We will learn how to develop confidence intervals in R.”

For your analysis in R, the data file can be downloaded from here.”

Let’s Start with the Mean

“Can someone tell me how we can get the confidence interval of the mean diameter?” she asked.

John raised his hand to answer. “Yes, John, go ahead,” said Jenny.

“The $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ is $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ . If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown $\sigma$ . However, for small sample sizes, the sample standard deviation s is itself subject to error. It may be far from the true value of $\sigma$ . So it is preferable to use t-distribution for the confidence interval of the mean since $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ tends to a t-distribution with (n-1) degrees of freedom.

So I would say that the $100(1-\alpha)\%$ confidence interval of the population mean is the interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .”

“Excellent. Let’s all compute the 95% confidence interval of the mean diameter of the trunk of the London planetrees from the sample I gave you. The 95% confidence interval of the true mean diameter ( $\mu$ ) is $\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$ .”

They all started calculating. Some of them pulled out t-tables to get the t-critical ( $t_{0.025,29}$ ), i.e., the quantile from the t-distribution corresponding to the upper tail probability of 0.025, and 29 degrees of freedom.

Jenny stopped them. “Don’t compute the interval by hand. Let’s do it in R.”

“Read the data file first. I am assuming that you all have the file in the same folder that you set the working directory to.”

# Read the data file -- Jenny's sample data #
jenny_sample_data = read.csv("jenny_data.csv",header=T)
nsamples = nrow(jenny_sample_data)

“Did you notice that the file has seven columns? The diameter of the tree trunk is recorded in the first column, tree_dbh. I am sure you figured out what the other columns are, except maybe the column that reads brnch_ligh that has Yes or No as the inputs. We will get to that in a bit.”

“Let’s compute the sample mean ( $\bar{x}$ ), sample variance ( $s^{2}$ ), and the sample standard deviation ( $s$ ).”

#statistics of the sample #
jenny_samplemu = mean(jenny_sample_data$tree_dbh)
jenny_samplevar = var(jenny_sample_data$tree_dbh)
jenny_samplesd = sd(jenny_sample_data$tree_dbh)

“What you were about to look up from the t-table can be obtained using this command in R.”

qt(0.975,df=29)

[1] 2.04523

“The t-critical value is 2.04523.”

“We can write a few lines to compute the confidence interval for any $\alpha$ level, and show them graphically.”

# Confidence interval of the mean using t-distribution #
alpha = 0.95 # 95% confidence interval 
t = alpha + (1-alpha)/2
n = nsamples

cit_lb = jenny_samplemu - qt(t,df=(n-1))*((jenny_samplesd/sqrt(n))) 
cit_ub = jenny_samplemu + qt(t,df=(n-1))*((jenny_samplesd/sqrt(n)))

ci_t = c(cit_lb,cit_ub)

stripchart(ci_t,type="l",lwd=3,col="green",main="CI on Mean",xlab="Diameter (inches)")
stripchart(jenny_samplemu,add=T,lwd=3,col="green")

“The sample mean ( $\bar{x}$ ) is 22.433 inches and the intervals extend out to 19.67 inches on the left and 25.20 inches on the right.”

“There is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ .”

The kids looked happy seeing the image on their screen.

Next, let’s explore the Variance and the Standard Deviation

“Do you remember the equation for the confidence interval of the variance and the standard deviation?” Jenny asked.

Samantha wanted to answer.

“The $100(1-\alpha)\%$ confidence interval of the population variance $\sigma^{2}$ is the interval $[\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]$ . We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation,” she said.

“The interval $[\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population standard deviation $\sigma$ .”

Jenny was pleased. She has an outstanding cohort. They always come prepared.

“Yes. In our case, $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with 29 degrees of freedom. The lower and upper critical values at the 95% confidence interval $\chi_{l,29}$ and $\chi_{u,29}$ can be obtained from the Chi-square table, or, as you guessed, can be computed in R using a simple command. Try these,” she said.

qchisq(0.975,df=29)

[1] 45.72229

qchisq(0.025,df=29)

[1] 16.04707

“Of course, we can automate this and make nice graphics for the intervals. Use this code.”

# Confidence interval of the variance and standard deviation using Chisq-distribution #
u = alpha + (1-alpha)/2
l = 1 - u

chi_low = ((n-1)*jenny_samplevar)/qchisq(u,df=(n-1)) 
chi_up = ((n-1)*jenny_samplevar)/qchisq(l,df=(n-1))

ci_var = c(chi_low,chi_up)
ci_sd = c(sqrt(chi_low),sqrt(chi_up))

stripchart(ci_var,type="l",lwd=3,col="green",main="CI on Variance",xlab="Diameter^2 (inches^2)")
stripchart(jenny_samplevar,add=T,lwd=3,col="green")

stripchart(ci_sd,type="l",lwd=3,col="green",main="CI on Standard Deviation",xlab="Diameter (inches)")
stripchart(jenny_samplesd,add=T,lwd=3,col="green")

“The sample variance is 54.74 $inches^{2}$ , with a lower bound of 34.72 $inches^{2}$ and an upper bound of 98.92 $inches^{2}$ . Notice how the intervals are not symmetric. Do you remember why?”

“The sample standard deviation is 7.40 inches. The 95% lower and upper bounds are 5.90, and 9.95 inches respectively.”

The kids spent a few minutes typing up the lines of code to get the intervals and make the graphics.

Jenny continued the lesson.

“Now, look at the third column in the data file, brnch_ligh. When I measured the trunk circumference, I also noticed that, for some trees, there were branch problems caused by lights or wires. I recorded this as a “yes” or a “no” depending on whether or not there was a problem. Who would think to have nice colorful lights on the trees would be damaging 🙁

Can one of you count through the sample and tell me what is the estimate of the proportion of trees ( $\hat{p}$ ) that are damaged.”

“0.1,” someone answered from the back.

“Great, now, that invisible person, can you tell me how we can compute the confidence interval of the true proportion?”

Christina it was. She answered, “the $100(1-\alpha)%$ confidence interval for the true proportion p is $[\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ , assuming that the estimate $\hat{p}$ can be approximated by a normal distribution for a reasonably large sample size.”

“And, here is how you do that in R,” added Jenny, as she typed up these lines.

# Confidence interval of the proportion using z-distribution # 
jenny_sampleproportion = length(which(jenny_sample_data$brnch_ligh=="Yes"))/nsamples

z = alpha + (1-alpha)/2

p_low = jenny_sampleproportion - qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))
p_up = jenny_sampleproportion + qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))

ci_p = c(p_low,p_up)

stripchart(ci_p,type="l",lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="green")

“The sample estimated damage proportion ( $\hat{p}$ ) is 0.1 since there are 3 damaged trees in a sample of 30 trees. The lower bound and the upper bound of the 95% confidence interval are -0.007 and 0.207, symmetric around 0.1,” Jenny interpreted as she looked at the plot and then at the kids.

Some students were shaking their head in disagreement. They don’t like the fact that the confidence interval produced a negative value. What could have gone wrong?

Christina jumped up at once. “Didn’t we learn last week that when p is at the boundaries, the sampling distribution exhibits skew and is not symmetric. Then it is not appropriate to approximate it to a normal distribution even for large sample sizes. We should perhaps develop the sampling distribution and the confidence intervals of the damage proportion using the idea of the Bootstrap?”

“That is an astute observation Christina,” said Jenny. “Yes, since we have an estimate close to 0, it is better to develop the bootstrap confidence intervals. That will eliminate the errors induced due to inappropriate assumptions,” she added.

Let’s do a Bootstrap in R

“The basis for the bootstrap is that the sample data of 30 trees can be used to approximate the probability distribution function of the population. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution f. It is easy enough to think of drawing numbers with replacement from these 30 numbers. Since each value is equally likely, the bootstrap sample will consist of numbers from the original data, some may appear more than one time, and some may not appear at all in a random sample.” Jenny explained the core concept of the bootstrap once again before showing them the code to do it.

# Bootstrap confidence intervals ## 
nboot = 1000
bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)

bootstrap_replicate_mean = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_var = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_sd = matrix(NA,nrow=nboot,ncol=1)

for (i in 1:nboot)
 {
   ind = 1:nsamples
   bootstrap_ind = sample(ind,nsamples,replace=T)

   bootstrap_sample_damage = jenny_sample_data$brnch_ligh[bootstrap_ind]
   
   bootstrap_replicate_proportion[i,1] = length(which(bootstrap_sample_damage=="Yes"))/nsamples

   bootstrap_sample_diam = jenny_sample_data$tree_dbh[bootstrap_ind]

   bootstrap_replicate_mean[i,1] = mean(bootstrap_sample_diam)
   bootstrap_replicate_var[i,1] = var(bootstrap_sample_diam)
   bootstrap_replicate_sd[i,1] = sd(bootstrap_sample_diam)
 }

“Here, I am drawing the bootstrap sample 1000 times, and, for each bootstrap sample, I am computing the proportion, the mean, the variance, and the standard deviation. So, in the end, I will have 1000 replicates of the damage proportion, the sample mean, sample variance and standard deviation — our sampling distributions.

Drawing a bootstrap in R is very simple. Just use the “sample” command.

ind = 1:nsamples

sample(ind,nsamples,replace=T)

I first create an indicator vector that has numbers 1 to 30. From this vector, I draw samples with replacement to get numbers from 1 to 30, some may appear more than once, some may not appear, depending on the sample. These are the trees that we selected as part of the bootstrap sample. From these trees, we take the diameter and whether or not it is damaged due to lights. The rest of the lines in the loop are just to compute the statistics from the bootstrap sample.” Jenny clearly explained here lines.

“Now, let’s plot the distribution of the damage proportion derived from the bootstrap samples. Type these lines.”

# Distribution of the sample proportion #
hist(bootstrap_replicate_proportion,main="Bootstrap Replicates of the Damage Proportion",font=2,font.lab=2,xlab="Sample Proportion")

ci_p_boot = quantile(bootstrap_replicate_proportion,c(0.025,0.975))
abline(v=ci_p_boot,col="red")

“The 95% lower and the upper confidence limits are the 2.5th and the 97.5th percentiles of the sampling distribution. We can show them as red lines on the distribution.”

“Compare this to the previous confidence interval computed by assuming a normal distribution. We can see that the sampling distribution of the proportion is skewed and the interval is asymmetric. It is also not producing any negative values,” said Jenny as she showed a few more lines of tricks.

# compare with previous CI # 
stripchart(ci_p_boot,type="l",lwd=3,col="black",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="black")
text(0.17,1.1, "Bootstrap Confidence Interval")

stripchart(ci_p,type="l",add=T,at=0.7,lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,at=0.7,add=T,lwd=3,col="green")

“I have a few more things in here to compute the bootstrap confidence interval of the mean, variance, and the standard deviation. You can check them out for yourself.

Here is the full code.”

Jenny was preparing to end the class.

“Can we know the true mean, variance and the proportion?” Someone asked.

Jenny was happy to hear this. She had planned an assignment for them exactly for this.

“From this sample alone, we cannot say. We can, however, say that there is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ , $\sigma^{2}$ and $p$ . But, I want you to understand it by doing it yourself.

So, here’s the deal. This week, I want you to go out to the city and collect a sample of 30 trees. For each tree, I want you to measure the diameter and whether or not it is damaged. Each one of you will randomly select 30 locations in the city and bring back a sample. Make sure you more or less cover all the boroughs. In the end, we will have 40 different samples, each with 30 data points. We will have much fun analyzing that data, answering your questions, and getting a deeper understanding of the confidence intervals, in R.” Jenny ended her session by saying this.

So the kids left with excitement, to spend their spare time during the week collecting data for the trees in the City. You may encounter them this week. They might be hugging a tree or two. Don’t mistake them for some tree-hugging hippies. They are teaching you how to be confident with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Let’s Start with the Mean

Next, let’s explore the Variance and the Standard Deviation

Jenny continued the lesson.

Let’s do a Bootstrap in R

Enjoy this blog? Please spread the word :)