February 2019 – dataanalysisclassroom

Lesson 82 – Riding with confidence, in R: Week 3

On the eleventh day of February 2019, Jenny showed her students how to compute and visualize confidence intervals in R. Her demo included the confidence interval on the mean, variance/standard deviation, and proportions. She also presented the code to develop bootstrap confidence intervals for any parameter. All this was based on a random sample that she collected.

But she wanted the students to have hands-on experience of data gathering and know the real meaning of confidence intervals, in that, for a 95% level, there is a 95% probability of selecting a sample for which the confidence interval will contain the true parameter value, $\mu, \sigma^{2}$ or p. So she sent them out to collect data through random sampling. The 40 students each brought back samples. 40 different samples of 30 trees each.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the eighteenth day of February 2019, the students came back with the samples. They all developed confidence intervals from their samples. Jenny showed them how the sample mean was converging to the true mean, and that over 40 confidence intervals, roughly two (5%) may not contain the truth. They also learned how the interval shrinks as one gets more and more samples.

Then, Jenny wanted them to understand the issues with sampling. So she sent them off for a second time. This time, the students divided themselves into teams, visiting different boroughs and collecting samples only from that borough.

Today, they are all back with their new samples.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Monday, February 25, 2019

In which the students come back with new samples, borough wise. Jenny explains the traps due to sampling bias and the basic types of sampling strategies.

It is Samantha who is always at the forefront of the class. She was leading seven other students in team Manhattan. Each of these eight students gathered data for 30 trees in Manhattan. So, they create eight different confidence intervals — representing the intervals from Manhattan. Samantha shows her example — the locations of the data that she gathered and the confidence interval of the true mean.

The 95% confidence interval of the true mean diameter $\mu$ is $\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$

“This sample may or may not contain the true mean diameter, but I know that there is a 95% probability of selecting this sample which when used to develop the confidence intervals, will contain the truth,” she said.

John was heading team Bronx. He followed Samantha to show the places he visited and the confidence interval from his sample. His team also had eight students, and each of them gathered data for 30 trees from the Bronx.

John strongly feels that he may be the unfortunate one whose confidence interval will not contain the truth. He may be one among the two. Let’s see.

The leaders of the other three teams, Justin, Harry, and Susan also prepare their confidence intervals. They all go up to the board and display their intervals.

“From last week, we know that the approximation for the true mean $\mu$ is 21.3575,” said Samantha as she projects the vertical line to show it. As you all know, Jenny showed last week that the sample mean was converging to 21.3575 as the samples increased. The principle of consistency.

As they discuss among themselves, Jenny entered the room and gazed at the display. She knew something that the students did not know. But she concealed her amusement and passed along the file to collect the data from all 40 students.

Like last week, the students fill out the file with the data they collected, borough wise. The file will have a total of 1200 row entries. Here is the file. As you have rightly observed, the boro code column is now ordered since the file went from team Manhattan to team Staten Island.

Jenny used the same lines as last week to create a plot to display all the 40 confidence intervals.

Assuming that the file is in the same folder you set your working directory to, you can use this line to read the data into R workspace.

# Reading the tree data file from the students - boro wise #
students_data_boro = read.csv("students_data_borough.csv",header=T)

Use these lines to plot all the confidence intervals.

# Jenny puts it all together and shows #
# 40 students went to collect data from specific boroughs in NYC #
# Each student collects 30 samples #
# Each student develops the confidence intervals #
 
 nstudents = 40
 nsamples = 30
 
 students = 1:nstudents

 alpha = 0.95
 t = alpha + (1-alpha)/2
 
 sample_mean = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd = matrix(NA,nrow=nstudents,ncol=1)
 ci_t = matrix(NA,nrow=nstudents,ncol=2)
 
 for(i in 1:nstudents)
 {
   ind = which(students_data_boro$student_index == i)
 sample_data = students_data_boro$tree_dbh[ind]
 sample_mean[i] = mean(sample_data)
 sample_sd[i] = sd(sample_data)
 cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))   
 cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
 ci_t[i,] = c(cit_lb,cit_ub)
 }

# Plot all the CIs #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(9,30),ylim=c(1,nstudents))
 
 stripchart(sample_mean[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],col="green", add=T, at = i)
 }

Once you execute these lines, you will also see this plot in your plot space.

“It looks somewhat different than the one we got last time,” said John.
“Let me see how many of them will contain the truth,” he added as he typed these lines.

He looks through all the confidence intervals for whether or not they cover the truth using an if statement. He calls them “false_samples.” Then, he plots all the confidence intervals once again, but this time, he uses a red color for the false samples. He also added the borough names to give a reference point.

nyc_random_truth = 21.3575

false_samples = students

 for (i in 1:nstudents)
 {
   if( (ci_t[i,1] > nyc_random_truth) || (ci_t[i,2] < nyc_random_truth) ) 
 {false_samples[i]=1} else 
 {false_samples[i]=0}
 }

false_ind = which(false_samples == 1)

# Plot all the CIs; now show the false samples #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(9,30),ylim=c(1,nstudents))

 stripchart(sample_mean[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],add=T,col="green",at = i)
 }

 abline(v=nyc_random_truth,lwd=3)

 for (i in 1:length(false_ind))
 {
   j = false_ind[i]
 stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
 }

 text(18,4,"Manhattan")
 text(24,12,"Bronx")
 text(27,20,"Brooklyn")
 text(28,28,"Queens")
 text(28,37,"Staten Island")

Try it yourself. You will also see this plot.

The students are puzzled. Clearly, there are more than 5% intervals that do not cover the truth. Why?

Jenny explains sampling bias

Jenny now explains to them about sampling bias. She starts with a map of all the data that the students brought.

We will get a more detailed explanation for creating maps in R in some later lessons. For now, you can type these lines that Jenny used to create a map.

### Full Map ###
 library(maps)
 library(maptools)
 library(RColorBrewer)
 library(classInt)
 library(gpclib)
 library(mapdata)
 library(fields)
 
 plotvar <- students_data_boro$tree_dbh
 
 nclr <- 6 # Define number of colours to be used in plot
 
 plotclr <- brewer.pal(nclr,"Greens") # Define colour palette to be used
 
# Define colour intervals and colour code variable for plotting
 class <- classIntervals(plotvar, nclr, style = "quantile")
 colcode <- findColours(class, plotclr)
 plot(students_data_boro$longitude,students_data_boro$Latitude,cex=0.55,pch=15, col = colcode, xlab="Longitude",ylab="Latitude",font=2,font.lab=2)

 map("county",regions ="New York",add=T)

 title("London Planetrees")

 legend("topleft", legend = names(attr(colcode, "table")), fill = attr(colcode, "palette"), cex = 1, bty = "n",title="Diameter (inches)")

“Look at this map. I am showing the location of the tree based on the latitude and longitude you all recorded. Then, for each point, I am also showing the diameter of the tree using a color bar. Thick trees, i.e., those with larger diameters are shown in darker green. Likewise, thin trees are shown in lighter green. Do you notice anything?” asked Jenny.

Samantha noticed it right away. “The trees in Manhattan have smaller diameters. Mostly, they have dull green shade,” she said.

“Precisely,” Jenny continues. “The trees are not all randomly distributed. They are somewhat clustered with smaller diameters in Manhattan and Bronx and larger diameters in the other boroughs.

Since you collected all your samples from a specific borough, there is a risk of sampling bias.

We can make good inferences about the population only if the sample is representative of the population as a whole.

In other words, the distribution of the sample must be like the distribution of the population from which it comes. In our case, the trees in Manhattan are not fully representative of the entire trees in the City. There was sampling bias, a tendency to collect a sample that is not entirely representative of the population.

For team Manhattan, since the distribution of your sample is dissimilar to that of the population, your statements about the truth are not accurate. You will have a bias — poor inference.

See, the sample mean also does not converge. Even at n=1200, there is still some element of variability.

Last week when you collected data for the trees, I asked you to gather them randomly by making sure that you visit all the boroughs. In other words, I asked you to collect random samples. Randomly visiting all the boroughs will avoid the issues arising from sampling bias. They give a more representative sample and minimize the errors in the inference. That is why we did not see this bias last week.”

“Are there types of sampling?” asked Justin.

Jenny replied. “At the very basic level, “simple random sampling” method, “stratified random sampling” method and “cluster random sampling” method. One of these days, I will show you how to implement these sampling strategies in R. For now let’s talk about their basics.

What you did in the first week was a simple random sampling method. Your sampling frame was all possible London planetrees in NYC. Each tree in this frame has an equal chance of being selected. From this frame, you randomly selected 30 trees. This is sampling without replacement. You did not take the measurements for the same tree two times. Computationally, what you did was to draw without replacement, a sequence of n random numbers from 1 to N. Mostly, you will get an equal proportion of trees from each borough.

Then there is the stratified random sampling method. Here, we can divide the population into strata — subpopulations or separate sampling frames. Within each frame or strata, we can do simple random sampling to collect data. The number of samples taken from each stratum or subpopulation is proportional to the size of the stratum. In other words, if we know the percent number of trees in Manhattan compared to the total number of trees, we can approximately sample that percentage from the Manhattan strata. One thing I can do is to assume that each of your teams got a simple random sample from a stratum, and combine the five teams to give me a full representative sample. An inference from this combined sampled will be more accurate than individual strata samples.

In the cluster random sampling method, we can first divide the population into clusters and then randomly select some clusters. The data from these clusters will make up the sample. Imagine that we divide the city into zip codes — each zip code is a cluster. Then we can randomly select some zip codes as part of our sampling strategy. All the trees in these zip codes make up our cluster random sample. However, if there is not much variability within each cluster, we run the risk of not representing the entire population, and hence poor inference.

We can also do systematic sampling, like selecting every 10th tree, but again, we should ensure that we cover the range. If not, we might get a biased sample.”

“How did you know that the borough wise sampling would be biased?” asked someone.

Well, for a one-line answer, you can say it was an educated guess. For a one-lesson answer, you should wait till next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 81 – Riding with confidence, in R: Week 2

Monday, February 18, 2019

In which the students come back with samples. Jenny shows them sampling distributions, the real meaning of confidence interval, and a few other exciting things using their data. She sends them off on a second task.

Where is the Mean

Samantha, John, and Christina explore

Samantha: These are the London planetrees I have data for.

The $100(1-\alpha)\%$ confidence interval of the population mean ( $\mu$ ) is the interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .

I have a sample of 30 trees. For these 30 trees, the sample mean is 20.6 inches and the sample standard deviation is 9.06 inches. Based on this, the confidence interval of $\mu$ is $[20.6 - 2.05\frac{9.06}{\sqrt{30}}, 20.6 + 2.05\frac{9.06}{\sqrt{30}}]$ .

[17.22 inches, 23.98 inches]

John: I collected data for 30 trees too. Here is my data.

And here is the confidence interval I came up with; [17.68 inches, 24.05 inches]. For my data, the sample mean is 20.87 inches and the sample standard deviation is 8.52 inches.

Christina: I have a sample of 30 too. Here is where I took them from.

And, here is the 95% confidence interval of the mean; [19.9 inches, 24.9 inches].

Their sample statistics are different. Their confidence intervals are different. They begin to wonder whether their interval contains the truth, $\mu$ , or, whose interval contains the truth?

Jenny puts all the intervals in context

Jenny shares with the students, an empty data file with the following headers.

The students fill out the file with the data they collected last week. Her class has 40 students, so when the file came back to Jenny, it had a total of 40*30 = 1200 row entries.

Jenny uploaded this file here, for the analysis she is going to show the class.

Jenny: I am sure you all had much fun last week visiting different places in the city and collecting data for the analysis. I am hoping that all of you randomly selected the places to visit. Based on what Sam and Christina showed, it looks like the places are fairly spread out — so we would have gotten a random sample from the population of the trees.

Sam, John, and Christina; the three of them computed the 95% confidence interval of the true mean $\mu$ based on their sample statistics. They found their intervals to be different.

Let’s look at all the 40 confidence intervals. Now that I have the data from you all, I will show you how to do it in R. Essentially, I will take each one of your data, compute the sample statistics and use them to compute the respective confidence intervals. We have 40 different samples, so we will get 40 different confidence intervals — each interval is a statement about the truth. If you remember what we have been discussing about the confidence intervals, for a 95% confidence level,

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of $\mu$ .

In other words, approximately 95% of the 40 samples (38 of them) may contain the truth. 5% (2 of them) may not contain the truth.

Let’s see who is covering the truth and who is not!

I want you all to work out the code with me. Here we go.

First, we will read the data file into R workspace.

# Reading the tree data file from the students # 
 students_data = read.csv("students_data.csv",header=T)

Next, use the following lines to compute the confidence interval for each student. I am essentially repeating the computation of sample mean, sample standard deviation and the confidence intervals, in a loop, 40 times, one for each student.

#40 students went to collect data from random locations in NYC # 
#Each student collects 30 samples #
#Each student develops the confidence intervals #
 
 nstudents = 40
 nsamples = 30
 
 students = 1:nstudents

 alpha = 0.95
 t = alpha + (1-alpha)/2
 
 sample_mean = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd = matrix(NA,nrow=nstudents,ncol=1)
 
 ci_t = matrix(NA,nrow=nstudents,ncol=2)
 
 for(i in 1:nstudents)
  {
  ind = which(students_data$student_index == i)
  
  sample_data = students_data$tree_dbh[ind]
  
  sample_mean[i] = mean(sample_data)
  sample_sd[i] = sd(sample_data)
  
  cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))   
  cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
  
  ci_t[i,] = c(cit_lb,cit_ub)
  }

Now, let’s plot the 40 intervals to see them better. A picture is worth a thousand numbers. Use these lines. They explain themselves.

#Plot all the CIs #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")

for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],col="green", add=T, at = i)
 }

The students are overjoyed looking at the pretty image.

John: This is neat. All our intervals are different, owing to each one of us bringing a different random sample. How we know which one of us contains the truth and which one of us does not is still not clear.

Jenny: We can get a close approximation based on the principle of consistency.

$\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0$

As n approaches infinity, the sample estimate approaches the true parameter. We can take the full data record that you all brought, 1200 data points and compute the overall mean.

true_mu = mean(students_data$tree_dbh)

While it is not exactly the true value, based on the idea of consistency, we can assume that it is converging to the true one.

$\mu$ = 21.3575 inches.

Look at this.

# Truth Convergence Plot #
 true_mean = sample_mean
 for (i in 1:nstudents)
 {
   ind = which(students_data$student_index <= i)
   true_mean[i] = mean(students_data$tree_dbh[ind])
 }

plot(true_mean,type="o",xlab="Increasing Sample",ylab="Convergence of the Sample Mean",font=2,font.lab=2,ylim=c(20,22))

My sixth sense and divine visions also tell me that this is converging to the truth!

Now, we can check which of these intervals contains the true mean $\mu$ , 21.3575 inches.

You can use this simple code to check.

false_samples = students
 for (i in 1:nstudents)
 {
   if( (ci_t[i,1] > true_mu) || (ci_t[i,2] < true_mu) ) 
 {false_samples[i]=1} else    
  {false_samples[i]=0}
 }
 false_ind = which(false_samples == 1)

It looks like number 4 and number 26 are not covering the truth. Harry and Miguel.

Let us point them out. Here, look at another pretty plot. The thick black vertical line is the truth, 21.3575 inches. The brown line is how the sample mean ( $\bar{x}$ ) is converging to the truth ( $\mu$ ) as we collect more samples. Harry and Miguel stand out. That is 2 out of 40; 5%.

# Plot all the CIs; now show the false samples #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
 stripchart(sample_mean[1],add=T,col="green")
 
for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],add=T,col="green",at = i)
 }

 abline(v=true_mu,lwd=3)
 lines(true_mean,students,type="o",col="brown",cex=0.5)

 for (i in 1:length(false_ind))
 {
   j = false_ind[i]

 stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
 }

 text(16,4.6,"Harry")
 text(26,26.6,"Miguel")

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of $\mu$ .

Miguel: What happens to the confidence interval when we compute it with larger sample sizes? I mean, what is the relationship between the confidence interval and sample size?

Jenny: Look at the equation for the confidence interval of the population mean; $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .

Theoretically, as n tends to infinity, the interval shrinks to 0 since the sample mean converges to the true mean. So, as n gets larger, the interval gets smaller. Let me show you how it works. Take our data, and, just as we computed the sample mean for increasing sample size to see the convergence, compute the confidence interval also at each step.

# How does CI change with more sample size
 sample_mean_update = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd_update = matrix(NA,nrow=nstudents,ncol=1)
 
ci_t_update = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
 ind = which(students_data$student_index <= i)
 ns = length(ind)

 sample_data = students_data$tree_dbh[ind]

 sample_mean_update[i] = mean(sample_data)
 sample_sd_update[i] = sd(sample_data)

cit_lb = sample_mean_update[i] - qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))   
cit_ub = sample_mean_update[i] + qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))
 
ci_t_update[i,] = c(cit_lb,cit_ub)
 }

Now, we can plot these confidence intervals against increasing sample size to see what happens. Just remember that each interval is an interval that is based on an increased sample size than the previous. The first interval has 30 data points, the second one has 60 data points, the third one, 90, the fourth, 120 etc.

# Plot all the CIs #
 stripchart(ci_t_update[1,],vertical=T,type="l",col="green",main="",ylab="Diameter (inches)",ylim=c(16,25),xlim=c(1,nstudents),xlab="Increasing Sample Size")
stripchart(sample_mean_update[1],vertical=T,add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t_update[i,],type="l",col="green",main="",add=T,at = i,vertical=T)
   stripchart(sample_mean_update[i],col="green", add=T, at = i,vertical=T)
 }
 
lines(students,true_mean,type="o",col="brown",cex=0.5)

See how the interval gets smaller and smaller as the sample size increases. If we do this enough number of times, the interval vanishes — as we will approach the entire population.

T or Z

Christina: If we have a large sample, why can’t we use $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ as the $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ ? For a large sample size, we can assume s, the sample standard deviation of the large sample as $\sigma$ .

Jenny: Yes, we can do that. As n increases, the difference between T and Z is minimal. However, it is better to use t-distribution for confidence intervals of $\mu$ , so we don’t fall into traps. I will show those traps next week.

Where is the Standard Deviation and Proportion

Jenny: Can you follow the same logic and develop the confidence intervals for $\sigma$ ? How many of you will cover the truth, how many of you will not?

Again, approximately, 95% of the intervals will cover the true $\sigma$ , i.e., roughly 38 of your intervals will have the truth, 2 will not. Check this out.

alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u
 
sample_var = matrix(NA,nrow=nstudents,ncol=1)
ci_sd = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
  ind = which(students_data$student_index == i)
  ns = length(ind)
 
  sample_data = students_data$tree_dbh[ind]

  sample_var[i] = var(sample_data)

  chi_low = ((ns-1)sample_var[i])/qchisq(u,df=(ns-1))   
  chi_up = ((ns-1)sample_var[i])/qchisq(l,df=(ns-1))

  ci_sd[i,] = c(sqrt(chi_low),sqrt(chi_up))
 }

true_sd = sqrt((sum((students_data$tree_dbh-true_mu)^2))/length(students_data$tree_dbh))

# False samples #
false_samples_sd = students
 for (i in 1:nstudents)
 {
   if( (ci_sd[i,1] > true_sd) || (ci_sd[i,2] < true_sd) ) 
 {false_samples_sd[i]=1} else 
 {false_samples_sd[i]=0}
 }

false_sd_ind = which(false_samples_sd == 1)

# Plot all the CIs; now show the false samples #
stripchart(ci_sd[1,],type="l",col="green",main="CI on Standard Deviation",xlab="Diameter (inches)",xlim=c(5,15),ylim=c(1,nstudents))
stripchart(sample_sd[1],add=T,col="green")
 
for (i in 2:nstudents)
 {
   stripchart(ci_sd[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_sd[i],add=T,col="green",at = i)
 }
 
abline(v=true_sd,lwd=3)

Jenny: Here all the intervals cover the truth, but, in the long-run, 95% of the intervals cover the truth.

John: How about proportion?

Jenny: Yes, can you all develop the confidence intervals for the proportion of damaged trees? I want you to use bootstrap confidence intervals for p instead of the one based on the assumption of normal distribution.

So the students typed away a few lines of code to finally create this.

# Confidence Interval of p #
 alpha = 0.95
 u = alpha + (1-alpha)/2
 l = 1 - u
 
nboot = 1000

 sample_p = matrix(NA,nrow=nstudents,ncol=1)
 ci_p_boot = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
   ind = which(students_data$student_index == i)
 ns = length(ind)

 sample_data = students_data$brnch_ligh[ind]

 sample_p[i] = length(which(sample_data=="Yes"))/ns

 bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)
 for (j in 1:nboot)
     {
       ind2 = 1:ns
   bootstrap_ind = sample(ind2,ns,replace=T)   
bootstrap_sample_damage = sample_data[bootstrap_ind]   
bootstrap_replicate_proportion[j,1] = length(which(bootstrap_sample_damage=="Yes"))/ns }
 
ci_p_boot[i,] = quantile(bootstrap_replicate_proportion,c(l,u))
 }
 
true_p = length(which(students_data$brnch_ligh=="Yes"))/length(students_data$brnch_ligh)

# False samples #
 false_samples_p = students
 for (i in 1:nstudents)
 {
   if( (ci_p_boot[i,1] > true_p) || (ci_p_boot[i,2] < true_p) ) 
 {false_samples_p[i]=1} else 
 {false_samples_p[i]=0}
 }
 
false_p_ind = which(false_samples_p == 1)

# Plot all the CIs; now show the false samples #
 stripchart(ci_p_boot[1,],type="l",col="green",main="CI on Proportion",xlab="Proportion",xlim=c(0,0.5),ylim=c(1,nstudents))
 stripchart(sample_p[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_p_boot[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_p[i],add=T,col="green",at = i)
 }

 abline(v=true_p,lwd=3)

 for (i in 1:length(false_p_ind))
 {
   j = false_p_ind[i]
 stripchart(ci_p_boot[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_p[j],col="red", lwd=3, add=T, at = j)
 }

You can also check out their full code here.

As the class comes to an end, Samantha looks fulfilled. “Wow, that is quite a day,” she said. “We covered many things today, but more importantly the true meaning of the confidence intervals. There is a 95% probability of selecting a sample whose confidence interval will contain the true value.”

“Assuming our data is collected without any bias,” interjected Jenny. The students looked puzzled.

“This week, I want you to go out and collect the data again, but this time, divide yourselves into boroughs. Eight of you will belong to team Manhattan and will collect tree data from Manhattan only. You will still collect data randomly but will do it only from Manhattan. Eight of you are team Queens and you collect data randomly from Queens. The other three teams, team Bronx, team Brooklyn, and team Staten Island will do the same in their boroughs. We will again have 30 trees each, and a total of 1200 trees.”

So, the students go off on another mission. Some of them are still puzzled or rather have a “what difference does it make” face on them. Some of them are as usual excited to hug the trees again. We will all wait and see what they come up with. More fun with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 80 – Riding with confidence, in R: Week 1

Monday, February 11, 2019

In which Jenny shows her students how to compute and visualize confidence intervals in R, and sends them off on a task.

“Did you know that the symbol of the department of parks and recreations is a cross between the leaf of the London plane and a maple leaf?” asked Jenny as she opened her high school data class. She volunteered to help 40 high school kids recently.

The obvious response from the kids is no response.

“It is,” she said. “More than 10% of the trees are London Planetrees. They grow up to 20 – 30 meters, with a trunk circumference of 3 meters.”

She projected a map on the big screen.

“These are the places I randomly selected to visit and record the diameter of the trunk of the London planetrees. You have this data in the handout. We will learn how to develop confidence intervals in R.”

For your analysis in R, the data file can be downloaded from here.”

Let’s Start with the Mean

“Can someone tell me how we can get the confidence interval of the mean diameter?” she asked.

John raised his hand to answer. “Yes, John, go ahead,” said Jenny.

“The $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ is $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ . If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown $\sigma$ . However, for small sample sizes, the sample standard deviation s is itself subject to error. It may be far from the true value of $\sigma$ . So it is preferable to use t-distribution for the confidence interval of the mean since $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ tends to a t-distribution with (n-1) degrees of freedom.

So I would say that the $100(1-\alpha)\%$ confidence interval of the population mean is the interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .”

“Excellent. Let’s all compute the 95% confidence interval of the mean diameter of the trunk of the London planetrees from the sample I gave you. The 95% confidence interval of the true mean diameter ( $\mu$ ) is $\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$ .”

They all started calculating. Some of them pulled out t-tables to get the t-critical ( $t_{0.025,29}$ ), i.e., the quantile from the t-distribution corresponding to the upper tail probability of 0.025, and 29 degrees of freedom.

Jenny stopped them. “Don’t compute the interval by hand. Let’s do it in R.”

“Read the data file first. I am assuming that you all have the file in the same folder that you set the working directory to.”

# Read the data file -- Jenny's sample data #
jenny_sample_data = read.csv("jenny_data.csv",header=T)
nsamples = nrow(jenny_sample_data)

“Did you notice that the file has seven columns? The diameter of the tree trunk is recorded in the first column, tree_dbh. I am sure you figured out what the other columns are, except maybe the column that reads brnch_ligh that has Yes or No as the inputs. We will get to that in a bit.”

“Let’s compute the sample mean ( $\bar{x}$ ), sample variance ( $s^{2}$ ), and the sample standard deviation ( $s$ ).”

#statistics of the sample #
jenny_samplemu = mean(jenny_sample_data$tree_dbh)
jenny_samplevar = var(jenny_sample_data$tree_dbh)
jenny_samplesd = sd(jenny_sample_data$tree_dbh)

“What you were about to look up from the t-table can be obtained using this command in R.”

qt(0.975,df=29)

[1] 2.04523

“The t-critical value is 2.04523.”

“We can write a few lines to compute the confidence interval for any $\alpha$ level, and show them graphically.”

# Confidence interval of the mean using t-distribution #
alpha = 0.95 # 95% confidence interval 
t = alpha + (1-alpha)/2
n = nsamples

cit_lb = jenny_samplemu - qt(t,df=(n-1))*((jenny_samplesd/sqrt(n))) 
cit_ub = jenny_samplemu + qt(t,df=(n-1))*((jenny_samplesd/sqrt(n)))

ci_t = c(cit_lb,cit_ub)

stripchart(ci_t,type="l",lwd=3,col="green",main="CI on Mean",xlab="Diameter (inches)")
stripchart(jenny_samplemu,add=T,lwd=3,col="green")

“The sample mean ( $\bar{x}$ ) is 22.433 inches and the intervals extend out to 19.67 inches on the left and 25.20 inches on the right.”

“There is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ .”

The kids looked happy seeing the image on their screen.

Next, let’s explore the Variance and the Standard Deviation

“Do you remember the equation for the confidence interval of the variance and the standard deviation?” Jenny asked.

Samantha wanted to answer.

“The $100(1-\alpha)\%$ confidence interval of the population variance $\sigma^{2}$ is the interval $[\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]$ . We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation,” she said.

“The interval $[\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population standard deviation $\sigma$ .”

Jenny was pleased. She has an outstanding cohort. They always come prepared.

“Yes. In our case, $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with 29 degrees of freedom. The lower and upper critical values at the 95% confidence interval $\chi_{l,29}$ and $\chi_{u,29}$ can be obtained from the Chi-square table, or, as you guessed, can be computed in R using a simple command. Try these,” she said.

qchisq(0.975,df=29)

[1] 45.72229

qchisq(0.025,df=29)

[1] 16.04707

“Of course, we can automate this and make nice graphics for the intervals. Use this code.”

# Confidence interval of the variance and standard deviation using Chisq-distribution #
u = alpha + (1-alpha)/2
l = 1 - u

chi_low = ((n-1)*jenny_samplevar)/qchisq(u,df=(n-1)) 
chi_up = ((n-1)*jenny_samplevar)/qchisq(l,df=(n-1))

ci_var = c(chi_low,chi_up)
ci_sd = c(sqrt(chi_low),sqrt(chi_up))

stripchart(ci_var,type="l",lwd=3,col="green",main="CI on Variance",xlab="Diameter^2 (inches^2)")
stripchart(jenny_samplevar,add=T,lwd=3,col="green")

stripchart(ci_sd,type="l",lwd=3,col="green",main="CI on Standard Deviation",xlab="Diameter (inches)")
stripchart(jenny_samplesd,add=T,lwd=3,col="green")

“The sample variance is 54.74 $inches^{2}$ , with a lower bound of 34.72 $inches^{2}$ and an upper bound of 98.92 $inches^{2}$ . Notice how the intervals are not symmetric. Do you remember why?”

“The sample standard deviation is 7.40 inches. The 95% lower and upper bounds are 5.90, and 9.95 inches respectively.”

The kids spent a few minutes typing up the lines of code to get the intervals and make the graphics.

Jenny continued the lesson.

“Now, look at the third column in the data file, brnch_ligh. When I measured the trunk circumference, I also noticed that, for some trees, there were branch problems caused by lights or wires. I recorded this as a “yes” or a “no” depending on whether or not there was a problem. Who would think to have nice colorful lights on the trees would be damaging 🙁

Can one of you count through the sample and tell me what is the estimate of the proportion of trees ( $\hat{p}$ ) that are damaged.”

“0.1,” someone answered from the back.

“Great, now, that invisible person, can you tell me how we can compute the confidence interval of the true proportion?”

Christina it was. She answered, “the $100(1-\alpha)%$ confidence interval for the true proportion p is $[\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ , assuming that the estimate $\hat{p}$ can be approximated by a normal distribution for a reasonably large sample size.”

“And, here is how you do that in R,” added Jenny, as she typed up these lines.

# Confidence interval of the proportion using z-distribution # 
jenny_sampleproportion = length(which(jenny_sample_data$brnch_ligh=="Yes"))/nsamples

z = alpha + (1-alpha)/2

p_low = jenny_sampleproportion - qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))
p_up = jenny_sampleproportion + qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))

ci_p = c(p_low,p_up)

stripchart(ci_p,type="l",lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="green")

“The sample estimated damage proportion ( $\hat{p}$ ) is 0.1 since there are 3 damaged trees in a sample of 30 trees. The lower bound and the upper bound of the 95% confidence interval are -0.007 and 0.207, symmetric around 0.1,” Jenny interpreted as she looked at the plot and then at the kids.

Some students were shaking their head in disagreement. They don’t like the fact that the confidence interval produced a negative value. What could have gone wrong?

Christina jumped up at once. “Didn’t we learn last week that when p is at the boundaries, the sampling distribution exhibits skew and is not symmetric. Then it is not appropriate to approximate it to a normal distribution even for large sample sizes. We should perhaps develop the sampling distribution and the confidence intervals of the damage proportion using the idea of the Bootstrap?”

“That is an astute observation Christina,” said Jenny. “Yes, since we have an estimate close to 0, it is better to develop the bootstrap confidence intervals. That will eliminate the errors induced due to inappropriate assumptions,” she added.

Let’s do a Bootstrap in R

“The basis for the bootstrap is that the sample data of 30 trees can be used to approximate the probability distribution function of the population. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution f. It is easy enough to think of drawing numbers with replacement from these 30 numbers. Since each value is equally likely, the bootstrap sample will consist of numbers from the original data, some may appear more than one time, and some may not appear at all in a random sample.” Jenny explained the core concept of the bootstrap once again before showing them the code to do it.

# Bootstrap confidence intervals ## 
nboot = 1000
bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)

bootstrap_replicate_mean = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_var = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_sd = matrix(NA,nrow=nboot,ncol=1)

for (i in 1:nboot)
 {
   ind = 1:nsamples
   bootstrap_ind = sample(ind,nsamples,replace=T)

   bootstrap_sample_damage = jenny_sample_data$brnch_ligh[bootstrap_ind]
   
   bootstrap_replicate_proportion[i,1] = length(which(bootstrap_sample_damage=="Yes"))/nsamples

   bootstrap_sample_diam = jenny_sample_data$tree_dbh[bootstrap_ind]

   bootstrap_replicate_mean[i,1] = mean(bootstrap_sample_diam)
   bootstrap_replicate_var[i,1] = var(bootstrap_sample_diam)
   bootstrap_replicate_sd[i,1] = sd(bootstrap_sample_diam)
 }

“Here, I am drawing the bootstrap sample 1000 times, and, for each bootstrap sample, I am computing the proportion, the mean, the variance, and the standard deviation. So, in the end, I will have 1000 replicates of the damage proportion, the sample mean, sample variance and standard deviation — our sampling distributions.

Drawing a bootstrap in R is very simple. Just use the “sample” command.

ind = 1:nsamples

sample(ind,nsamples,replace=T)

I first create an indicator vector that has numbers 1 to 30. From this vector, I draw samples with replacement to get numbers from 1 to 30, some may appear more than once, some may not appear, depending on the sample. These are the trees that we selected as part of the bootstrap sample. From these trees, we take the diameter and whether or not it is damaged due to lights. The rest of the lines in the loop are just to compute the statistics from the bootstrap sample.” Jenny clearly explained here lines.

“Now, let’s plot the distribution of the damage proportion derived from the bootstrap samples. Type these lines.”

# Distribution of the sample proportion #
hist(bootstrap_replicate_proportion,main="Bootstrap Replicates of the Damage Proportion",font=2,font.lab=2,xlab="Sample Proportion")

ci_p_boot = quantile(bootstrap_replicate_proportion,c(0.025,0.975))
abline(v=ci_p_boot,col="red")

“The 95% lower and the upper confidence limits are the 2.5th and the 97.5th percentiles of the sampling distribution. We can show them as red lines on the distribution.”

“Compare this to the previous confidence interval computed by assuming a normal distribution. We can see that the sampling distribution of the proportion is skewed and the interval is asymmetric. It is also not producing any negative values,” said Jenny as she showed a few more lines of tricks.

# compare with previous CI # 
stripchart(ci_p_boot,type="l",lwd=3,col="black",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="black")
text(0.17,1.1, "Bootstrap Confidence Interval")

stripchart(ci_p,type="l",add=T,at=0.7,lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,at=0.7,add=T,lwd=3,col="green")

“I have a few more things in here to compute the bootstrap confidence interval of the mean, variance, and the standard deviation. You can check them out for yourself.

Here is the full code.”

Jenny was preparing to end the class.

“Can we know the true mean, variance and the proportion?” Someone asked.

Jenny was happy to hear this. She had planned an assignment for them exactly for this.

“From this sample alone, we cannot say. We can, however, say that there is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ , $\sigma^{2}$ and $p$ . But, I want you to understand it by doing it yourself.

So, here’s the deal. This week, I want you to go out to the city and collect a sample of 30 trees. For each tree, I want you to measure the diameter and whether or not it is damaged. Each one of you will randomly select 30 locations in the city and bring back a sample. Make sure you more or less cover all the boroughs. In the end, we will have 40 different samples, each with 30 data points. We will have much fun analyzing that data, answering your questions, and getting a deeper understanding of the confidence intervals, in R.” Jenny ended her session by saying this.

So the kids left with excitement, to spend their spare time during the week collecting data for the trees in the City. You may encounter them this week. They might be hugging a tree or two. Don’t mistake them for some tree-hugging hippies. They are teaching you how to be confident with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 79 – Pull yourself up by your bootstraps

Over the past several lessons we have been learning about estimates, standard error of the estimates and confidence interval of the estimates.

We have been using the ‘sample’ to estimate the true value of the parameter. What we estimate from the sample will enable us to obtain the closest answer in some sense for the true unknown population parameter.

For example, the mean $\bar{x}$ , variance $s^{2}$ , or proportion $\hat{p}$ computed from the sample data are good guesses (estimates or estimators) of the mean $\mu$ , variance $\sigma^{2}$ and proportion $p$ of the population.

We also know that when we think of an estimate, we think of an interval or a probability distribution, instead of a point value. The truth may be in this interval if we have good representative samples, i.e., if the sample distribution is similar to the population distribution.

Assumptions or Approximations

In this inferential journey, to compute the standard error or to derive the confidence interval of the estimates, we have been making some assumptions and approximations that are deemed reasonable.

For example, it is reasonable to assume a normal distribution for the sample mean $\bar{x}$ .

$\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$

The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. $E[\bar{x}]=\mu$ .

The standard deviation of the sample mean, or the standard error of the estimate is $\frac{\sigma}{\sqrt{n}}$ .

This visual should be handy.

—

To derive the confidence interval of the variance and standard deviation, we assumed that $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom.

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}$

Depending on the degrees of freedom, the distribution of $\frac{(n-1)s^{2}}{\sigma^{2}}$ looks like this.

—

Most recently, we assumed that the estimate for proportion $\hat{p}$ can be approximated by a normal distribution.

$\hat{p} \sim N(p, \frac{p(1-p)}{n})$

We derived the confidence interval of the population proportion as $[\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ , based on this assumption.

Let’s examine this assumption once again.

In a sample of size n, proportion can be estimated as $\hat{p} = \frac{S_{n}}{n}$ , where $S_{n}$ is the number of favorable instances for the thing we are measuring. $\hat{p}$ can be approximated to a normal distribution since $S_{n}$ can be approximated to a normal distribution.

If we take Bernoulli random variables (0,1) for $X_{1}, X_{2}, X_{3}, …, X_{n}$ , $S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}$ , the number of successes, follows a Binomial distribution $f(x) = \frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}$ .

For a large enough sample size n, the distribution function of $S_{n}$ can be well-approximated by the normal distribution.

Let’s do some experiments and see if this is reasonable.

Look at this animation. I am showing the Binomial probability distribution function for p = 0.5 while n increases from 10 to 100.

It looks like an approximation to a normal distribution is very reasonable.

Now, look at these two animations that show the Binomial probability function for p = 0.1 and p = 0.95, i.e., when p is near the boundaries.

Clearly, the distributions exhibit skew and are not symmetric. An approximation to normal distribution even for large values of n, i.e., a big sample, is not appropriate.

How then can we be confident about the standard error and the confidence intervals?
For that matter, how can we derive the standard error or the intervals of a parameter whose limiting form is not known, or mathematically very complicated?

Enter the Bootstrap

Bradley Efron invented a computer-based method, the bootstrap, for estimating the standard error and the confidence intervals of parameters. There is no need for any theoretical calculations or underlying assumptions regarding the mathematical structure or the type of the distribution of the parameter. Instead, bootstrap samples from the data are used.

What is a bootstrap sample?

Suppose we have a data sample, $x_{1},x_{2}, … ,x_{n}$ , a bootstrap sample is a random sample of size n drawn with replacement from these n data points.

Imagine we have the following data: 28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7 as the concentration of Ozone measured in seven locations in New York City.

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these seven data points is 1/7, we can randomly draw seven numbers from these seven values.

Think that you are playing the game of Bingo and these seven numbers are chips in your rolling machine. The only difference is, each time you get a number, record it and put it back in the roller until you draw seven numbers. Sample with replacement.

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7), some may appear more than one time, and some may not appear at all in a random sample.

I played this using the roller. Here is a bootstrap sample from the original numbers.

As you can see, 28.4 appeared one time, 28.6, 27.5 and 28.7 did not appear, 26.3 appeared 2 times and 27.7 appeared 3 times.

Here are two more bootstrap samples like that.

Basis

The basis for bootstrap samples is that the sample data can be used to approximate the probability distribution function of the population. As you saw before, by putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution $f$ .

Take a very simple example of rolling two dice in the game of monopoly. The true probability distribution $f$ of the count (dice 1 + dice 2) is based on the fact that there are 11 possible outcomes and the likelihood of each outcome is the ratio of the total ways we can get the number to 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 is 1/36.

Suppose we roll the dice a hundred times and record the total count, we can use the observed frequencies of the outcomes from this sample data to approximate the actual probability distribution.

Look at these 100 counts as outcomes of rolling two dice 100 times.

The frequency plot shown in black lines closely approximates the true frequency shown in red.

The empirical distribution $\hat{f}$ is the proportion of times each value in the data sample $x_{1}, x_{2}, …, x_{n}$ occurs. The observed frequency $\hat{f}$ is a sufficient statistic for the true distribution $f$ with an assumption that the data have been generated by random sampling from the true distribution $f$ . All the information of the true distribution $f$ is contained in the empirical distribution $\hat{f}$ .

An unknown population distribution $f$ has produced the observed data $x_{1}, x_{2}, …, x_{n}$ . We can use the observed data to approximate $f$ by its empirical distribution $\hat{f}$ and then use the empirical distribution to generate bootstrap replicates of the data. Since $f$ generated x, $\hat{f}$ can be used to generate the bootstrap samples.

$f$ has given $x_{1}, x_{2},…, x_{n}$ can be used to estimate $\hat{f}$ will be used to generate a $bootstrapsample$ .

This is the basis.

Bootstrap Replicates

Once we generate enough bootstrap samples, we can use the estimators (formula to estimate the parameter) on these samples. For example, if we want to represent the true population mean $\mu$ , we can apply the equation for the sample mean $\bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}$ on each of these bootstrap samples to generate bootstrap replicates of the mean.

If we want to represent the true population variance using an interval, we can apply $s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$ on these bootstrap samples to generate replicates of the variance.

Likewise, if we want an interval for the true proportion, we apply $\hat{p} = \frac{S_{n}}{n}$ on the bootstrap samples to get replicates of the proportion.

Each bootstrap sample will produce a replicate of the parameter. Efron prescribes anywhere between 25 to 200 bootstrap replications for a good approximation of the limiting distribution of the estimate. As the number of bootstrap replicates approaches infinity, the standard error as measured by the standard deviation of these replicates will approach the true standard error.

Let’s look at the bootstrap replicates of the sample mean and the sample standard deviation for the Ozone data for which we used the Bingo machine to generate the bootstrap samples. In a later coding lesson, we will learn how to do it using simple functions in RStudio.

For bootstrap sample 1, the sample mean is 27.26 and the sample standard deviation is 0.82.

For bootstrap sample 2, the sample mean is 27.61 and the sample standard deviation is 0.708.

I do this 200 times. Here is how the distribution of the sample mean ( $\bar{x}$ ) obtained from 200 bootstrap replicates looks like.

Here is the distribution of the sample standard deviation.

Like this, we can develop the intervals of any type of parameters by applying the relevant estimator on the bootstrap sample.

Bootstrap confidence interval

Finally, we can use the percentiles of the bootstrap replicates as the confidence limits of the parameter.

Take a 90% confidence interval for instance. From the bootstrap replicates, we can say that there is a 90% probability that the true mean $\mu$ will be between $\bar{x}_{[5]}$ and $\bar{x}_{[95]}$ , the 5th and the 95th percentiles of the bootstrap replicates.

$P(\bar{x}_{[5]} \le \mu \le \bar{x}_{[95]}) = 0.90$

For the Ozone example, the 90% confidence interval of the true mean is [27.114, 28.286] and the 90% confidence interval of the true standard deviation is [0.531 1.087].

Look at these plots.

We can define a $100(1-\alpha)%$ bootstrap confidence interval for the true mean $\mu$ as $[l, u] = [\bar{x}_{[\alpha]}, \bar{x}_{[1-\alpha]}]$ .

There are many more uses of the bootstrap, and we will have much fun with it in due course.

But for today, let’s end with Efron’s statement on what other names were suggested to him for his method.

“I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.””

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Month: February 2019

Lesson 82 – Riding with confidence, in R: Week 3

Jenny explains sampling bias

Lesson 81 – Riding with confidence, in R: Week 2

Where is the Mean

Samantha, John, and Christina explore

Jenny puts all the intervals in context

T or Z

Where is the Standard Deviation and Proportion

Lesson 80 – Riding with confidence, in R: Week 1

Let’s Start with the Mean

Next, let’s explore the Variance and the Standard Deviation

Jenny continued the lesson.

Let’s do a Bootstrap in R

Lesson 79 – Pull yourself up by your bootstraps

Assumptions or Approximations

—

—

Enter the Bootstrap

What is a bootstrap sample?

Basis

Bootstrap Replicates

Bootstrap confidence interval

Enjoy this blog? Please spread the word :)