R Lessons – dataanalysisclassroom

Lesson 99 – The Two-Sample Hypothesis Tests in R

Over the past seven lessons, we equipped ourselves with the necessary theory of the two-sample hypothesis tests.

Lessons 92 and 93 were about the hypothesis test on the difference in proportions. In Lesson 92, we learned Fisher’s Exact Test to verify the difference in proportions. In Lesson 93, we did the same using the normal distribution as a limiting distribution when the sample sizes are large.

Lessons 94, 95, and 96 were about the hypothesis test on the difference in means. In Lesson 94, we learned that under the proposition that the population variances of two random variables are equal, the test-statistic, which uses the pooled variance, follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. In Lesson 95, we learned that Welch’s t-Test could be used when we cannot make the equality of population proportions assumption. In Lesson 96, we learned the Wilcoxon’s Rank-sum Test that used a ranking approach to approximate the significance of the differences in means. This method is data-driven, needs no assumptions on the limiting distribution, and works well for small sample sizes.

Lesson 97 was about the hypothesis test on the equality of variances. Here we got a sneak-peak into a new distribution called F-distribution, which is the converging distribution of the ratio of two Chi-square distributions divided by their respective degrees of freedom. The test-statistic is the ratio of the sample variances, which we can verify against an F-distribution with $n_{1}-1$ numerator degrees of freedom and $n_{2}-1$ denominator degrees of freedom.

All these hypothesis tests can also be done using the bootstrap approach, which we learned in Lesson 98. It uses the data at hand to generate the null distribution of any desired statistic as long as it is computable from the data. It is very flexible. There is no need to make any assumptions on the data’s distributional nature or the limiting distribution for the test-statistic.

It is now time to put all of this learning to practice.

The City of New York Department of Sanitation (DSNY) archives data on the monthly tonnage of trash collected from NYC residences and institutions.

To help us with today’s workout, I dug up our gold mine, the “Open Data for All New Yorkers” page, and found an interesting dataset on the monthly tonnage of trash collected from NYC residences and institutions.

Here is a preview.

The data includes details on when (year and month) DSNY collected the trash, from where (borough and community district), and how much (tonnage of refuse, source-separated recyclable paper, and source-separated metal, glass, plastic, and beverage cartons). They also have other seasonal data such as the tonnage of leaves collected in November and December and tons of Christmas trees collected from curbside in January.

For our workout today, we can use this file named “DSNY_Monthly_Tonnage_Data.csv.” It is a slightly modified version of the file you will find from the open data page.

My interest is in Manhattan’s community district 9 — Morningside Heights, Manhattanville, and Hamilton Heights.

I want to compare the tonnage of trash collected in winter to that in summer. Let’s say we compare the data in Manhattan’s community district 9 for February and August to keep it simple.

Are you ready?

The Basic Steps

Step1: Get the data

You can get the data from here. The file is named DSNY_Monthly_Tonnage_Data.csv. It is a comma-separated values file which we can conveniently read into the R workspace.

Step 2: Create a new folder on your computer

And call this folder “lesson99.”
Make sure that the data file “DSNY_Monthly_Tonnage_Data.csv” is in this folder.

Step 3: Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson99” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson99_code.R”

Step 4: Choose your working directory

“lesson99” is the folder where we stored the code and the input data file. Use setwd("path") to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Or, if you opened this “lesson99_code.R” file from within your “lesson99” folder, you can also use the following line to set your working directory.

setwd(getwd())

getwd() gets the current working directory. If you opened the file from within your director, then getwd() will resort to the working directory you want to set.

Step 5: Read the data into R workspace

Execute the following command to read your input .csv file.

# Read the data file
nyc_trash_data = read.csv("DSNY_Monthly_Tonnage_Data.csv",header=T)

Since the input file has a header for each column, we should have header=T in the command when we read the file to ensure that all the rows are correctly read.

Your RStudio interface will look like this when you execute this line and open the file from the Global Environment.

Step 6: Extracting a subset of the data

As I said before, for today’s workout, let’s use the data from Manhattan’s community district 9 for February and August. Execute the following lines to extract this subset of the data.

# Extract February refuse data for Manhattan's community district 9 
feb_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==2))

feb_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[feb_index]

# Extract August refuse data for Manhattan's community district 9
aug_index = which((nyc_trash_data$BOROUGH=="Manhattan") & (nyc_trash_data$COMMUNITYDISTRICT==9) & (nyc_trash_data$MONTH==8))

aug_refuse_data = nyc_trash_data$REFUSETONSCOLLECTED[aug_index]

In the first line, we look up for the row index of the data corresponding to Borough=Manhattan, Community District = 9, and Month = 2.

In the second line, we extract the data on refuse tons collected for these rows.

The next two lines repeat this process for August data.

Sample 1, February data for Manhattan’s community district 9 looks like this:
2953.1, 2065.8, 2668.2, 2955.4, 2799.4, 2346.7, 2359.6, 2189.4, 2766.1, 3050.7, 2175.1, 2104.0, 2853.4, 3293.2, 2579.1, 1979.6, 2749.0, 2871.9, 2612.5, 455.9, 1951.8, 2559.7, 2753.8, 2691.0, 2953.1, 3204.7, 2211.6, 2857.9, 2872.2, 1956.4, 1991.3

Sample 2, August data for Manhattan’s community district 9 looks like this:
2419.1, 2343.3, 3032.5, 3056.0, 2800.7, 2699.9, 3322.3, 3674.0, 3112.2, 3345.1, 3194.0, 2709.5, 3662.6, 3282.9, 2577.9, 3179.1, 2460.9, 2377.1, 3077.6, 3332.8, 2233.5, 2722.9, 3087.8, 3353.2, 2992.9, 3212.3, 2590.1, 2978.8, 2486.8, 2348.2

The values are in tons per month, and there are 31 values for sample 1 and 30 values for sample 2. Hence, $n_{1}=31$ and $n_{2}=30$ .

The Questions for Hypotheses

Let’s ask the following questions.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?

We can visualize their distributions. Execute the following lines to create a neat graphic of the boxplots of sample 1 and sample 2.

# Visualizing the distributions of the two samples
boxplot(cbind(feb_refuse_data,aug_refuse_data),horizontal=T, main="Refuse from Manhattan's Community District 9")
text(1500,1,"February Tonnage",font=2)
text(1500,2,"August Tonnage",font=2)

p_threshold = 2500 # tons of refuse
abline(v=p_threshold,lty=2,col="maroon")

You should now see your plot space changing to this.

Next, you can execute the following lines to compute the preliminaries required for our tests.

# Preliminaries
  # sample sizes
  n1 = length(feb_refuse_data)
  n2 = length(aug_refuse_data)
  
  # sample means
  x1bar = mean(feb_refuse_data)
  x2bar = mean(aug_refuse_data)

  # sample variances
  x1var = var(feb_refuse_data)
  x2var = var(aug_refuse_data)

  # sample proportions
  p1 = length(which(feb_refuse_data < p_threshold))/n1
  p2 = length(which(aug_refuse_data < p_threshold))/n2

Hypothesis Test on the Difference in Means

We learned four methods to verify the hypothesis on the difference in means:

The two-sample t-Test
Welch’s two-sample t-Test
Wilcoxon’s Rank-sum Test
The Bootstrap Test

Let’s start with the two-sample t-Tests. Assuming a two-sided alternative, the null and alternate hypothesis are:

$H_{0}: \mu_{1} - \mu_{2}=0$

$H_{A}: \mu_{1} - \mu_{2} \neq 0$

The alternate hypothesis indicates that substantial positive or negative differences will allow the rejection of the null hypothesis.

We know that for the two-sample t-Test, we assume that the population variances are equal, and the test-statistic is $t_{0}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{s^{2}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$ , where $s^{2}=(\frac{n_{1}-1}{n_{1}+n_{2}-2})s_{1}^{2}+(\frac{n_{2}-1}{n_{1}+n_{2}-2})s_{2}^{2}$ is the pooled variance. $t_{0}$ follows a T-distribution with $n_{1}+n_{2}-2$ degrees of freedom. We can compute the test-statistic and check how likely it is to see such a value in a T-distribution (null distribution) with so many degrees of freedom.

Execute the following lines in R.

pooled_var = ((n1-1)/(n1+n2-2))*x1var + ((n2-1)/(n1+n2-2))*x2var

t0 = (x1bar-x2bar)/sqrt(pooled_var*((1/n1)+(1/n2)))

df = n1+n2-2

pval = pt(t0,df=df)

print(pooled_var)
print(df)
print(t0)
print(pval)

We first computed the pooled variance $s^{2}$ and, using it, the test-statistic $t_{0}$ . Then, based on the degrees of freedom, we compute the p-value.

The p-value is 0.000736. For a two-sided test, we compare this with $\frac{\alpha}{2}$ . If we opt for a rejection rate $\alpha$ of 5%, then, since our p-value of 0.000736 is less than 0.025, we reject the null hypothesis that the means are equal.

All these steps can be implemented using a one-line command in R.

Try this:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = TRUE)

You will get the following prompt in the console.

Two Sample t-test

data: feb_refuse_data and aug_refuse_data

t = -3.3366, df = 59, p-value = 0.001472

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-658.2825 -164.7239

sample estimates:

mean of x mean of y

2510.697 2922.200

As inputs of the function, we need to provide the data (sample 1 and sample 2). We should also indicate that we evaluate a two-sided alternate hypothesis and that the population variances are equal. This will prompt the function to execute the standard two-sample t-test.

While the function provides additional information, all we need for our test are pointers from the first line. t=-3.3366 and df=59 are the test-statistic and the degrees of freedom that we computed earlier. The p-value looks different than what we computed. It is two times the value we estimated. The function provides a value that is double that of the original p-value to allow us to compare it to $\alpha$ instead of $\frac{\alpha}{2}$ . Either way, we know that we can reject the null hypothesis.

Next, let’s implement Welch’s t-Test. When the population variances are not equal, i.e., when $\sigma_{1}^{2} \neq \sigma_{2}^{2}$ , the test-statistic is $t_{0}^{*}=\frac{\bar{x_{1}}-\bar{x_{2}}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$ , and it follows an approximate T-distribution with f degrees of freedom which can be estimated using the Satterthwaite’s equation: $f = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2}/n_{1})^{2}}{(n_{1} - 1)}+\frac{(s_{2}^{2}/n_{2})^{2}}{(n_{2}-1)}}$

Execute the following lines.

f = (((x1var/n1)+(x2var/n2))^2)/(((x1var/n1)^2/(n1-1))+((x2var/n2)^2/(n2-1)))

t0 = (x1bar-x2bar)/sqrt((x1var/n1)+(x2var/n2))

pval = pt(t0,df=f)

print(f)
print(t0)
print(pval)

The pooled degrees of freedom is 55, the test-statistic is -3.3528, and the p-value is 0.000724. Comparing this with 0.025, we can reject the null hypothesis.

Of course, all this can be done using one line where you indicate that the variances are not equal:

t.test(feb_refuse_data,aug_refuse_data,alternative="two.sided",var.equal = FALSE)

When you execute the above line, you will see the following on your console.

Welch Two Sample t-test
data: feb_refuse_data and aug_refuse_data
t = -3.3528, df = 55.338, p-value = 0.001448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-657.4360 -165.5705
sample estimates:
mean of x mean of y
2510.697 2922.200

Did you notice that it runs Welch’s two-sample t-Test when we indicate that the variances are not equal? Like before, we can compare the p-value to 0.05 and reject the null hypothesis.

The mean of the February refuse collected from Manhattan’s community district 9 is different from August. Since the test-statistic is negative and the p-value is lower than the chosen level of rejection rate, we can conclude that the mean of February refuse tonnage is significantly lower than the mean of August refuse tonnage beyond a reasonable doubt.

How about Wilcoxon’s Rank-sum Test? The null hypothesis is based on the proposition that if $x_{1}$ and $x_{2}$ are samples from the same distribution, there will be an equal likelihood of one exceeding the other.

$H_{0}: P(x_{1} > x_{2}) = 0.5$

$H_{0}: P(x_{1} > x_{2}) \neq 0.5$

We know that this ranking and coming up with the rank-sum table is tedious for larger sample sizes. For larger sample sizes, the null distribution of the test-statistic W, which is the sum of the ranks associated with the variable of smaller sample size in the pooled and ordered data, tends to a normal distribution. This allows the calculation of the p-value by comparing W to a normal distribution. We can use the following command in R to implement this.

wilcox.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will get the following message in the console when you execute the above line.

Wilcoxon rank sum test with continuity correction
data: feb_refuse_data and aug_refuse_data
W = 248, p-value = 0.001788
alternative hypothesis: true location shift is not equal to 0

Since the p-value of 0.001788 is lower than 0.05, the chosen level of rejection, we reject the null hypothesis that $x_{1}$ and $x_{2}$ are samples from the same distribution.

Finally, let’s take a look at the bootstrap method. The null hypothesis is that there is no difference between the means.

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) = 0.5$

$H_{0}: P(\bar{x}_{Feb}>\bar{x}_{Aug}) \neq 0.5$

We first create a bootstrap replicate of X and Y by randomly drawing with replacement $n_{1}$ values from X and $n_{2}$ values from Y.

For each bootstrap replicate i from X and Y, we compute the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ and check whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ . If yes, we register $S_{i}=1$ . If not, we register $S_{i}=0$ .

We repeat this process of creating bootstrap replicates of X and Y, computing the statistics $\bar{x}_{Feb}$ and $\bar{x}_{Aug}$ , and verifying whether $\bar{x}_{Feb}>\bar{x}_{Aug}$ and registering $S_{i} \in (0,1)$ a large number of times, say N=10,000.

The proportion of times $S_{i} = 1$ in a set of N bootstrap-replicated statistics is the p-value.

Execute the following lines in R to implement these steps.

#4. Bootstrap
  N = 10000
  null_mean = matrix(0,nrow=N,ncol=1)
  null_mean_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_mean_ratio[i] = mean(xboot)/mean(yboot) 
     if(mean(xboot)>mean(yboot)){null_mean[i]=1} 
   }
 
 pvalue_mean = sum(null_mean)/N
 hist(null_mean_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="Xbar/Ybar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(0.95,1000,paste("p-value=",pvalue_mean),col="red")

Your RStudio interface will then look like this:

A vertical bar will show up at a ratio of 1 to indicate that the area beyond this value is the proportion of times $S_{i} = 1$ in 10,000 bootstrap-replicated means.

The p-value is close to 0. We reject the null hypothesis since the p-value is less than 0.025. Since more than 97.5% of the times, the mean of February tonnage is less than August tonnage; there is sufficient evidence that they are not equal. So we reject the null hypothesis.

In summary, while the standard t-Test, Welch's t-Test, and Wilcoxon's Rank-sum Test can be implemented in R using single-line commands,

t.test(x1,x2,alternative="two.sided",var.equal = TRUE)
t.test(x1,x2,alternative="two.sided",var.equal = FALSE)
wilcox.test(x1,x2,alternative = "two.sided")

the bootstrap test needs a few lines of code.

Hypothesis Test on the Equality of Variances

We learned two methods to verify the hypothesis on the equality of variances; using F-distribution and using the bootstrap method.

For the F-distribution method, the null and the alternate hypothesis are

$H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}$

$H_{A}:\sigma_{1}^{2} \neq \sigma_{2}^{2}$

The test-statistic is the ratio of the sample variances, $f_{0}=\frac{s_{1}^{2}}{s_{2}^{2}}$

We evaluate the hypothesis based on where this test-statistic lies in the null distribution or how likely it is to find a value as large as this test-statistic $f_{0}$ in the null distribution.

Execute the following lines.

# 1. F-Test
  f0 = x1var/x2var
  
  df_numerator = n1-1
  
  df_denominator = n2-1
  
  pval = 1-pf(f0,df1=df_numerator,df2=df_denominator)
  
  print(f0)
  print(df_numerator)
  print(df_denominator)
  print(pval)

The test-statistic, i.e., the ratio of the sample variances, is 1.814.

The p-value, which, in this case, is the probability of finding a value greater than the test-statistic ( $P(F > f_{0})$ ), is 0.0562. When compared to a one-sided rejection rate ( $\frac{\alpha}{2}$ ) of 0.025, we cannot reject the null hypothesis.

We do have a one-liner command.

var.test(feb_refuse_data,aug_refuse_data,alternative = "two.sided")

You will find the following message when you execute the var.test() command.

F test to compare two variances
data: feb_refuse_data and aug_refuse_data
F = 1.814, num df = 30, denom df = 29, p-value = 0.1125
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8669621 3.7778616
sample estimates:
ratio of variances
1.813959

As usual, with the standard R function, we need to compare the p-value (which is two times what we computed based on the right-tail probability) with $\alpha=0.05$ . Since 0.1125 > 0.05, we cannot reject the null hypothesis that the variances are equal.

Can we come to the same conclusion using the bootstrap test? Let’s try. Execute the following lines in R.

# 2. Bootstrap
 N = 10000
   null_var = matrix(0,nrow=N,ncol=1)
   null_var_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
     null_var_ratio[i] = var(xboot)/var(yboot) 
     if(var(xboot)>var(yboot)){null_var[i]=1} 
   }
 
 pvalue_var = sum(null_var)/N
 hist(null_var_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="XVar/YVar",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,500,paste("p-value=",pvalue_var),col="red")

Your RStudio interface should look like this if you correctly coded up and executed the lines. The code provides a way to visualize the null distribution.

The p-value is 0.7945. 7945 out of the 10,000 bootstrap replicates had $s^{2}_{Feb}>s^{2}_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is greater than 0.025. The evidence (20.55% of the times) that the February tonnage variance is less than August tonnage is not sufficient to reject equality.

In summary, use var.test(x1,x2,alternative = "two.sided") if you want to verify based on F-distribution, or code up a few lines if you want to verify using the bootstrap method.

Hypothesis Test on the Difference in Proportions

For the hypothesis test on the difference in proportions, we can employ Fisher’s Exact Test, use the normal approximation under the large-sample assumption, or the bootstrap method.

For Fisher’s Exact Test, the test-statistic follows a hypergeometric distribution when $H_{0}$ is true. We could assume that the number of successes is fixed at $t=x_{1}+x_{2}$ , and, for a fixed value of $t$ , we reject $H_{0}:p_{1}=p_{2}$ for the alternate hypothesis $H_{A}:p_{1}>p_{2}$ if there are more successes in random variable $X_{1}$ compared to $X_{2}$ .

The p-value can be derived under the assumption that the number of successes $X=k$ in the first sample $X_{1}$ has a hypergeometric distribution when $H_{0}$ is true and conditional on a total number of t successes that can come from any of the two random variables $X_{1}$ and $X_{2}$ .

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Execute the following lines to implement this process, plot the null distribution and compute the p-value.

#Fisher's Exact Test
  x1 = length(which(feb_refuse_data < p_threshold))
  p1 = x1/n1
  
  x2 = length(which(aug_refuse_data < p_threshold))
  p2 = x2/n2
  
  N = n1+n2
  t = x1+x2
  
  k = seq(from=0,to=n1,by=1)
  p = k
  for(i in 1:length(k)) 
   {
    p[i] = (choose(t,k[i])*choose((N-t),(n1-k[i])))/choose(N,n1)
   }

  plot(k,p,type="h",xlab="Number of successes in X1",ylab="P(X=k)",font=2,font.lab=2)
  points(k,p,type="o",lty=2,col="grey50")
  points(k[13:length(k)],p[13:length(k)],type="o",col="red",lwd=2)
  points(k[13:length(k)],p[13:length(k)],type="h",col="red",lwd=2)
  pvalue = sum(p[13:length(k)])
  print(pvalue)

Your RStudio should look like this once you run these lines.

The p-value is 0.1539. Since it is greater than the rejection rate, we cannot reject the null hypothesis that the proportions are equal.

We can also leverage an in-built R function that does the same. Try this.

fisher_data = cbind(c(x1,x2),c((n1-x1),(n2-x2)))
fisher.test(fisher_data,alternative="greater")

In the first line, we create a 2×2 contingency table to indicate the number of successes and failures in each sample. Sample 1 has 12 successes, i.e., 12 times the tonnage in February was less than 2500 tons. Sample 2 has seven successes. So the contingency table looks like this:

    [,1]  [,2]
[1,] 12    19
[2,]  7    23

Then, fisher.test() funtion implements the hypergeometric distribution calculations.

You will find the following prompt in your console.

Fisher’s Exact Test for Count Data
data: fisher_data
p-value = 0.1539
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.712854 Inf
sample estimates:
odds ratio
2.050285

We get the same p-value as before — we cannot reject the null hypothesis.

Under the normal approximation for large sample sizes, the test-statistic $z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)$ . $p$ is the pooled proportion which can be compute as $p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}$ .

We reject the null hypothesis when the p-value $P(Z \ge z)$ is less than the rate of rejection $\alpha$ .

Execute the following lines to set this up in R.

# Z-approximation
  p = (x1+x2)/(n1+n2)
  z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2)))
  pval = 1-pnorm(z)

The p-value is 0.0974. Since it is greater than 0.05, we cannot reject the null hypothesis.

Finally, we can use the bootstrap method as follows.

# Bootstrap
 N = 10000
 null_prop = matrix(0,nrow=N,ncol=1)
 null_prop_ratio = matrix(0,nrow=N,ncol=1)
 
 for(i in 1:N)
   {
     xboot = sample(feb_refuse_data,replace=T)
     yboot = sample(aug_refuse_data,replace=T)
 
     p1boot = length(which(xboot < p_threshold))/n1 
     p2boot = length(which(yboot < p_threshold))/n2 

     null_prop_ratio[i] = p1boot/p2boot 
    if(p1boot>p2boot){null_prop[i]=1} 
  }
 
 pvalue_prop = sum(null_prop)/N
 hist(null_prop_ratio,font=2,main="Null Distribution Assuming H0 is True",xlab="P1/P2",font.lab=2)
 abline(v=1,lwd=2,lty=2)
 text(2,250,paste("p-value=",pvalue_prop),col="red")

We evaluated whether the low trash situation in February is different than August.

$H_{0}: P(p_{Feb}>p_{Aug}) = 0.5$

$H_{A}: P(p_{Feb}>p_{Aug}) > 0.5$

The p-value is 0.8941. 8941 out of the 10,000 bootstrap replicates had $p_{Feb}>p_{Aug}$ . For a 5% rate of error ( $\alpha=5\%$ ), we cannot reject the null hypothesis since the p-value is not greater than 0.95. The evidence (89.41% of the times) that the low trash situation in February is greater than August is insufficient to reject equality.

Your Rstudio interface will look like this.

In summary, 
use fisher.test(contingency_table,alternative="greater") 
if you want to verify based on Fisher's Exact Test; 

use p = (x1+x2)/(n1+n2); 
    z = (p1-p2)/sqrt(p*(1-p)*((1/n1)+(1/n2))); 
    pval = 1-pnorm(z) 
if you want to verify based on the normal approximation; 

code up a few lines if you want to verify using the bootstrap method.

Is the mean of the February refuse collected from Manhattan’s community district 9 different from August?
- Reject the null hypothesis. They are different.
Is the variance of the February refuse collected from Manhattan’s community district 9 different from August?
- Cannot reject the null hypothesis. They are probably the same.
Suppose we set 2500 tons as a benchmark for low trash situations, is the proportion of February refuse lower than 2500 tons different from the proportion of August? In other words, is the low trash situation in February different than August?
- Cannot reject the null hypothesis. They are probably the same.

Please take the next two weeks to digest all the two-sample hypothesis tests, including how to execute them in R.

Here is the full code for today’s lesson.

In a little over four years, we reached 99 lessons 😎

I will have something different for the 100. Stay tuned …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 91 – The One-Sample Hypothesis Tests in R

The last time we did something in R is in Lesson 83. Since then, we have been learning the hypothesis testing framework. In Lesson 84, we first made friends with the hypothesis tests — the concept of needing evidence beyond a reasonable doubt. In Lesson 85, we learned the framework — the elements of hypothesis testing that provide us with a systematic way of setting up the test and using data to verify the hypothesis.

In Lessons 86, 87, and 88, we learned the one-sample hypothesis test on proportion, mean, and the variance, respectively. Then, in Lesson 89, we learned the one-sample sign test as a complement to the one-sample hypothesis test on the mean, which can account for outliers as well. Last week (Lesson 90), we dived into the bootstrap-based method where we relaxed the assumptions on the null distribution. The bootstrap-based process provides us the flexibility to run the test on various statistics of interest.

Today, let’s pause and look at how to perform these hypothesis tests in R. Brace yourself for an R drive.

To help us with this lesson today, we will make use of the eight data points on 2003 Ford Focus vehicle mileage that Joe used in Lesson 89.

The Basic Steps

Step1: Get the data

You can get the data from here.

Step 2: Create a new folder on your computer

Let us call this folder “lesson91”.

Step 3: Create a new code in R

Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson91” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson91_code.R”

Step 4: Choose your working directory

“lesson91” is the folder where we stored the code. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace

Since the sample data is small enough, instead of saving the data in a text file and reading it into R workspace, we can directly input the data into the workspace as follows:

## Data Entry ##
# FORD MPG - reported # 
  x = c(21.7, 29, 28.1, 31.5, 24, 21.5, 28.7, 29)

The Baseline for Null Hypotheses

Let’s run the hypothesis tests on the mean, standard deviation, and the proportion. For this, we have to make some modest assumptions.

For the mean, we will use the EPA specified rating of 26.2 MPG. $\mu = 26.2$ MPG.

For comparing the standard deviation, we are unsure of the baseline standard deviation from the EPA rating. However, they provide us with a range that is between 22 MPG and 32 MPG. If we can assume that the range is covered by six standard deviations, i.e., $32 - 22 = 6 \sigma$ , we can get an estimate for baseline $\sigma$ . In this case, $\sigma = 1.67$ MPG.

Further, let’s also assume that one in eight cars will usually have a mileage less than the minimum rating of 22 MPG. $p = \frac{1}{8}$

# Compare to Baseline
  mu = 26
 
  rangex = c(22,32)
  sigma = diff(range_ford)/6

  n = length(x)
  p = 1/n

Hypothesis Test on the Mean

For the hypothesis test on the mean, we learned three methods, the t-test, the sign-test, and the bootstrap-based test.

Let’s start with the t-test. For the t-test, the null and the alternate hypothesis are:

$H_{0}: \mu \ge 26.2$ MPG

$H_{A}: \mu < 26.2$ MPG

The null distribution is a T-distribution with (n-1) degrees of freedom. The test statistic is $t_{0}=\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ . We have to verify how likely it is to see a value as large as $t_{0}$ in the null distribution.

Let’s compute the test statistic.

#T-Test #
   xbar = mean(x)

   s = sd(x)

   to = (xbar-mu)/(s/sqrt(n))

> to
   [1] 0.5173082

The test statistic is 0.517.

For a selected rejection rate of $\alpha$ we have to compute the p-value corresponding to this test statistic, or the critical value on the T-distribution.

   alpha = 0.05
   
   pvalue_ttest = pt(to,df=(n-1))   
   
   tcritical = qt(alpha,df=(n-1))

When you execute these lines, you will see that the p-value is 0.69, and the critical value from the t-distribution is -1.89. The “pt” function computed $P(T \le t_{0})$ from a T-distribution with user-specified degrees of freedom. For our case, df = 7. The “qt” function computes the quantile corresponding to a probability of 5%, i.e., $\alpha=0.05$ from a T-distribution with user-specified degrees of freedom.

> pvalue_ttest
   [1] 0.6895594

> tcritical
   [1] -1.894579

Since the p-value is greater than 0.05, or since $t_{0}>t_{critical}$ , we cannot reject the null hypothesis.

In R, there is a function to run this test — “t.test“

We could have used the function directly on the sample data x.

# using t-Test function in R #
   t.test(x,alternative = "less", mu = mu, conf.level = 0.95)

One Sample t-test
 data:  x
 t = 0.51731, df = 7, p-value = 0.6896
 alternative hypothesis: true mean is less than 26
 95 percent confidence interval:
      -Inf 29.20539
 sample estimates:
 mean of x 
   26.6875

It provides $t_{0}$ and the p-value as the outputs from which we can decide on the null hypothesis.

Next, let’s look at how to conduct the sign-test in R. The null and the alternative hypothesis for the sign-test are

$H_{0}: P(X > 26.2) = 0.5$

$H_{A}: P(X > 26.2) < 0.5$

If $H_{0}$ is true, about half of the values of sample X will be greater than 26.2 MPG, and about half of them will be negative. If $H_{A}$ is true, more than half of the values of sample X will be less than 26.2 MPG, i.e., the sample shows low mileages — significantly less than 26.2 MPG.

To run this test, we should compute $s^{+}$ , the test statistic that determines the number of values exceeding 26.2 MPG.

#Sign-Test
 splus = length(which(x>mu))

 > splus
   [1] 5

$s^{+}$ is 5.

Under the null hypothesis, $s^{+}$ follows a binomial distribution with a probability of 0.5. Using this assumption, we compute the p-value using the “binomial.test” function in R.

# Using Binom Test function in R 
   binom.test(splus,n,p=0.5,alternative = "less")

Exact binomial test
 data:  splus and n
 number of successes = 5, number of trials = 8, p-value = 0.8555
 alternative hypothesis: true probability of success is less than 0.5
 95 percent confidence interval:
  0.0000000 0.8888873
 sample estimates:
 probability of success 
                  0.625

The test provides the p-value, i.e., $P(S^{+} \le 5)$ from a binomial distribution with n = 8 and p = 0.5.

$P(S^{+} \le 5)=\sum_{k=0}^{k=5}{8 \choose k} p^{k}(1-p)^{8-k}=0.8555$

Since the p-value is greater than 0.05, the rejection rate, we cannot reject the null hypothesis.

Lastly, we will check out the bootstrap-based test on the mean. For the bootstrap-based one-sided test, the null hypothesis and the alternate hypothesis are

$H_{0}: P(\bar{x} > 26.2) = 0.5$

$H_{A}: P(\bar{x} > 26.2) < 0.5$

Use the following lines to set up and run the test.

#Bootstrap Test
 N = 10000

 xmean_null = matrix(NA,nrow=N,ncol=1)
  for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xmean_null[i,1] = mean(xboot)
   }
 
hist(xmean_null,col="pink",xlab="Mean of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(mu,10,pch=15,cex=2,col="blue")

abline(v=c(quantile(xmean_null,0.95)),col="black",lwd=2,lty=2)

pvalue_bootstrap = length(which(xmean_null>mu))/N

> pvalue_bootstrap
   [1] 0.7159

In the loop, we are executing the “sample” function to draw a bootstrap-replicate from the original data. Then, from this bootstrap-replicate, we compute the mean. Repeated such sampling and computation of the bootstrap-replicated mean statistic forms the null distribution. From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of 26.2 MPG.

Since the p-value (0.716) is greater than a 5% rejection rate, we cannot reject the null hypothesis.

We can also observe that the code provides a way to visualize the null distribution and the basis value of 26.2 MPG. If the basis value $\mu=26.2$ is so far out on the null distribution of $\bar{x}$ that less than 5% of the bootstrap-replicated test statistics are greater than $\mu$ , we would have rejected the null hypothesis.

Hypothesis Test on the Variance

For the hypothesis test on the variance, we learned two methods, the Chi-square test, and the bootstrap-based test.

Let’s first look at the Chi-square test. The null and alternate hypothesis is as follows:

$\sigma^{2}=1.67^{2}$

$\sigma^{2} \neq 1.67^{2}$

The alternate hypothesis is two-sided. Deviation in either direction (less than or greater than) will reject the null hypothesis.

If you can recall from Lesson 88, $\frac{(n-1)s^{2}}{\sigma^{2}}$ is our test statistic, $\chi^{2}_{0}$ , which we will verify against a Chi-square distribution with $(n-1)$ degrees of freedom.

# Chi-square Test
chi0 = ((n-1)*s^2)/sigma^2

pvalue_chi0 = pchisq(chi0,df=(n-1))

chilimit_right = qchisq(0.975,df=(n-1))

chilimit_left = qchisq(0.025,df=(n-1))

The “pchisq” function computes $P(\chi^{2} \le \chi^{2}_{0})$ from a Chi-square distribution with user-specified degrees of freedom, df = 7. The “qchisq” function computes the quantile corresponding to a specified rate of rejection. Since we are conducting a two-sided test, we calculate the limiting value on the left tail and the right tail.

 > pvalue_chi0
   [1] 0.9999914

The p-value is 0.999. Since it is greater than 0.975, we reject the null hypothesis based on the two-sided test.

> chi0
   [1] 35.60715

> chilimit_left
   [1] 1.689869

> chilimit_right
   [1] 16.01276

The lower and the upper bound from the null distribution are 1.69 and 16.01, respectively. Whereas, the test statistic $\chi^{2}_{0}$ is 35.60, well beyond the upper bound acceptable value. Hence we reject the null hypothesis.

Now, let’s look at how to run a bootstrap-based test for the standard deviation. For the bootstrap-based two-sided test, the null hypothesis and the alternate hypothesis are

$H_{0}: P(\sigma > 1.67) = 0.5$

$H_{A}: P(\sigma > 1.67) \neq 0.5$

Use the following lines to set up and run the test.

# Bootstrap Test for Standard Deviation
 N = 10000
 
xsd_null = matrix(NA,nrow=N,ncol=1)
 for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xsd_null[i,1] = sd(xboot)
   }

hist(xsd_null,col="pink",xlab="Standard Deviation of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(sigma,10,pch=15,cex=2,col="blue")
   abline(v=c(quantile(xsd_null,0.025),quantile(xsd_null,0.975)),col="black",lwd=2,lty=2)

pvalue_bootstrap = length(which(xsd_null>sigma))/N

> pvalue_bootstrap
   [1] 0.9747

As before, in the loop, we are executing the “sample” function to draw a bootstrap-replicate from the original data. Then, from this bootstrap-replicate, we compute the standard deviation. Repeated such sampling and computation of the bootstrap-replicated standard deviation statistic forms the null distribution for $\sigma$ . From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of 1.67 MPG.

Since the p-value (0.975) is greater than or equal to 0.975, ( $\frac{\alpha}{2}$ rejection rate), we reject the null hypothesis.

The code provides a way to visualize the null distribution and the basis value of 1.67 MPG. The basis value is far left on the null distribution. Almost 97.5% of the bootstrap-replicated test statistics are greater than $\sigma = 1.67$ — we reject the null hypothesis.

Hypothesis Test on the Proportion

For the hypothesis test on the proportion, we can employ the binomial distribution (parametric) as the null distribution, or use the bootstrap-based method (non-parametric) to generate the null distribution. For the parametric approach, the null and the alternative hypothesis are

$H_{0}: p \le \frac{1}{8}$

$H_{A}: p > \frac{1}{8}$

We are considering a one-sided alternative since the departure in one direction (greater than) is sufficient to reject the null hypothesis.

From a sample of eight cars, the null distribution is the probability distribution of observing any number of cars having a mileage of less than 22 MPG. In other words, out of the eight cars, we could have 0, 1, 2, 3, …, 8 cars to have a mileage less than 22 MPG, the lower bound specified by EPA.

The null distribution is the distribution of the probability of observing these outcomes. In a Binomial null distribution with n = 8 and p = 1/8, what is the probability of getting 0, 1, 2, …, 8? and if more than an acceptable number (as seen in the null distribution) is seen, we reject the null hypothesis.

We can generate and plot that null distribution in R using the following lines.

## Hypothesis Test on the Proportion
 
ncars = length(which(x<22))

plot(0:n,dbinom(0:n,n,prob=p),type="o",xlab="Number of cars with MPG less than the range",ylab="P(X=x)",font=2,font.lab=2)

x_at_alpha = qbinom(0.95,n,prop) # approx quantile for 5% rejection 

cord.x <- c(x_at_alpha,seq(x_at_alpha,n),x_at_alpha) 

cord.y <- c(0, dbinom(x_at_alpha:n,n,prob=p), 0) 

polygon(cord.x,cord.y,col="pink")

points(ncars,0.002,pch=15,col="blue",cex=2)

pval = sum(dbinom(ncars:n,n,prob=p))

In the first line, we are computing the number of cars that have a mileage of less than 22 MPG, which is the test statistic. We observe this to be two cars.

> ncars
   [1] 2

In the next few lines, we are plotting the null distribution as computed from a binomial distribution with n = 8 and p = 1/8, and visually showing the region of rejection using the “polygon” function.

Then, we compute the p-value as the probability of observing two or more cars having a mileage less than 22 MPG.

$P(Y \ge 2) = \sum_{k=2}^{k=8} {8 \choose k} p^{k}(1-p)^{8-k}=0.2636$

> pval
   [1] 0.2636952

Since the p-value is greater than 0.05, we cannot reject the null hypothesis (based on the limited sample) that one in eight cars (2003 Ford Focus) will have a mileage of less than 22 MPG.

Let’s wrap up the day by running the bootstrap test on the proportion. For the bootstrap-based hypothesis test on the proportion, the null and the alternate hypothesis are

$H_{0}: P(p > \frac{1}{8}) = 0.5$

$H_{A}: P(p > \frac{1}{8}) > 0.5$

If we get a large proportion of the null distribution to be greater than 1/8, we will reject the null hypothesis.

Use the following lines to execute the test.

# Bootstrap Test on Proportion 
 N = 10000

xprop_null = matrix(NA,nrow=N,ncol=1)
 for (i in 1:N)
   {
     xboot = sample(x,n,replace=T)
     xprop_null[i,1] = length(which(xboot<22))/n
   }

hist(xprop_null,col="pink",xlab="Proportion of the Distribution",font=2,font.lab=2,main="Null Distribution Assuming Ho is True")

points(prop,10,pch=15,cex=2,col="blue")

abline(v=c(quantile(xprop_null,0.95)),col="black",lwd=2,lty=2)

pval=length(which(xprop_null>prop))/N

> pval
   [1] 0.631

We are executing the “sample” function N = 10,000 times to draw bootstrap-replicates from the original data. Then, for each bootstrap-replicate, we compute the proportion that is less than 22 MPG. The 10,000 bootstrap-replicated proportions forms the null distribution for $p$ . From this null distribution, we calculate the p-value, i.e., the proportion of the null distribution that exceeds the baseline of $\frac{1}{8}$

Since the p-value (0.631) is less than 0.95, ( $1-\alpha$ rejection rate), we cannot reject the null hypothesis. 63% of the bootstrap-replicated proportion is greater than the baseline. If more than 95% of the bootstrap-replicated proportion is greater than the baseline, we would have rejected the null hypothesis. This scenario would have unfolded if the original sample had several cars with less than a mileage of 22 MPG. The bootstrap replicates would show them, in which case the proportion would be much greater than 1/8 in many replicates, and overall, the p-value would exceed 0.95.

Take the next two weeks to digest all the one-sample hypothesis tests, including how to execute the tests in R. In two weeks, we will move on to the two-sample hypothesis tests.

Here is the full code for today’s lesson.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 83 – The trees of New York

Prequel to “Riding with confidence, in R”

Saturday, February 9, 2019

In which Jenny reads Robert Frost and stumbles upon the trees of New York. She designs her next three lessons to demonstrate confidence intervals and various sampling strategies using R and the trees of New York.

Her head swayed towards her shoulder and she looked over the window, when, for the first time, she paid attention to the tree outside her apartment.

Blessing Lilies and Classic Dendrobium, all she had from Michael’s Emporium. Oaks and Pears and Crabapples, never once she heeds in the Big Apple.

“Alas, the trees are ubiquitous in this concrete jungle,” she thought, as she immediately resorted to NYC Open Data.

“A place that hosts our building footprints to the T should definitely have something on the trees,” she thought.

Sure enough. The 2015 Street Tree Census, the Tree Data for the Big Apple is available. It was conducted by volunteers and staff organized by NYC Parks and Recreations and partner organizations. Tree species, its diameter, and its perceived health are available along with a suite of accompanying data.

She immediately downloaded the master file and started playing around with it.

“This is a very big file and will need a lot of memory to load,” she thought. Hence, with the most important details, she created an abridged version and uploaded it here for you.

She fires up RStudio on her computer and immediately reads the file into her workspace.

# Reading the tree data file #
 nyc_trees = read.csv("2015StreetTreesCensus_TREES_abridged.csv",header=T)

“I wonder how many trees are there in the City.” The answer is 683,788.

nrow(nyc_trees)

In a city with 8.54 million, there is a tree for every twelve.

“What would be their types?” she asked.

# Types of trees #
 types = table(nyc_trees$spc_common)
 pie(types,radius=1,cex=0.75,font=2)
 sort(types)

133 different species with the top 5 being London planetree (87014), Honeylocust (64264), Callery pear (58931), Pink oak (53185) and Norway maple (34189).

“Wow! There are 87014 London planetrees in the City. Let me check out the population characteristics,” she thought as she typed a few lines.

## London planetree (87014) ## 

# locating London planetrees (lpt) in the data file #
 lpt_index = which(nyc_trees$spc_common == "London planetree")

#create a separate data frame for london planetrees #
lpt = nyc_trees[lpt_index,]

# London planetree Population #
 N = nrow(lpt)
 lpt_true_distribution_diam = lpt$tree_dbh

# True Mean #
 lpt_true_mu = mean(lpt_true_distribution_diam)

# True Variance #
 lpt_true_var = mean((lpt_true_distribution_diam - lpt_true_mu)^2)

# True Standard Deviation
 lpt_true_sd = sqrt(lpt_true_var)

# True Proportion of Light Damaged lpts #
 lpt_damaged = which(lpt$brnch_ligh=="Yes")
 lpt_true_damage_proportion = length(lpt_damaged)/N

# Plot the full distribution
boxplot(lpt_true_distribution_diam,horizontal=T,font=2,font.lab=2,boxwex=0.25,col="green",main="Diameter of London planetrees (inces)")

She first identified the row index for London planetree, created a separate data frame “lpt” for these trees using these indices and then computed the true mean, true variance and standard deviation of the tree diameters.

$\mu$ = 21.56 inches

$\sigma^{2}$ = 81.96 $inches^{2}$

$\sigma$ = 9.05 inches

She also noticed that there is a column for whether or not the tree is damaged due to lighting. She computed the true proportion of this.

p = 0.14

Then, as she always does, created a boxplot to check out the full distribution of the data.

“What about Manhattan,” she thought. You are not living in the city if you are not from Manhattan. So she counts the number of trees in each borough and their percentages.

## Count the number of trees in each borough ##
 manhattan_index = which(lpt$borocode==1)
 bronx_index = which(lpt$borocode==2)
 brooklyn_index = which(lpt$borocode==3)
 queens_index = which(lpt$borocode==4)
 staten_index = which(lpt$borocode==5)

 n_manhattan = length(manhattan_index)
 n_bronx = length(bronx_index)
 n_brooklyn = length(brooklyn_index)
 n_queens = length(queens_index)
 n_staten = length(staten_index)

 n_boro = c(n_manhattan,n_bronx,n_brooklyn,n_queens,n_staten)
 barplot(n_boro)

 p_boro = (n_boro/N)
 barplot(p_boro,names.arg = c("Manhattan","Bronx", "Brooklyn", "Queens", "Staten Island"),font=2)

“Hmm 🙁 Let Manhattan at least have the old trees,” she prays and types the following lines to create a map of the trees and their diameters. She also wants to check where the damaged trees are.

There are some libraries required for making maps in R. If you don’t have them, you should install first using install.packages() command.

Jenny is plotting the diameter. plotvar.
The other lines are cosmetics, liners, lipsticks, and blush.

## Plot of London planetrees ##
 library(maps)
 library(maptools)
 library(RColorBrewer)
 library(classInt)
 library(gpclib)
 library(mapdata)
 library(fields)

 par(mfrow=c(1,2))
# plot 1: diameter
 plotvar <- lpt_true_distribution_diam
 
 nclr <- 6 # Define number of colours to be used in plot
 
 plotclr <- brewer.pal(nclr,"Greens") # Define colour palette to be used

# Define colour intervals and colour code variable for plotting
 class <- classIntervals(plotvar, nclr, style = "quantile")

 colcode <- findColours(class, plotclr)

 plot(lpt$longitude,lpt$Latitude,cex=0.15,pch=15, col = colcode, xlab="",ylab="",axes=F,font=2,font.lab=2)

 map("county",region="New York",add=T)
 box()

 title("London Planetrees")
 legend("topleft", legend = names(attr(colcode, "table")), fill = attr(colcode, "palette"), cex = 1, bty = "n",title="Diameter (inches)")

# plot 2: show damaged trees
 plotvar <- lpt_true_distribution_diam
 ind_1 = which(lpt$brnch_ligh=="Yes")

 nclr <- 6 # Define number of colours to be used in plot

 plotclr <- brewer.pal(nclr,"Greens") # Define colour palette to be used

# Define colour intervals and colour code variable for plotting
 class <- classIntervals(plotvar, nclr, style = "quantile")

 colcode <- findColours(class, plotclr)

 plot(lpt$longitude,lpt$Latitude,cex=0,pch=15, col = colcode, xlab="",ylab="",axes=F,font=2,font.lab=2)

 points(lpt$longitude[ind_1],lpt$Latitude[ind_1],pch=20,cex=0.1)

 map("county",region="New York",add=T)
 box()

 title("London Planetrees - Damaged")

The older trees belong mostly to Queens and Brooklyn; so do the damaged trees.

“This will make for a nice example of sampling bias, confidence intervals, and sampling strategies,” she thought as she noticed that the trees with smaller diameters are in Manhattan.

“If we only sample from Manhattan, it will clearly result in a bias in estimation of the population characteristics. Manhattan sample is not fully representative of the population or the true distribution,” she thought.

“Let me compare the population to the sample distributions from the five boroughs.”

She first separates the data for each borough and plots them against the true distribution.

## Breakdown by Borough
 lpt_manhattan = lpt[manhattan_index,]
 lpt_bronx = lpt[bronx_index,]
 lpt_brooklyn = lpt[brooklyn_index,]
 lpt_queens = lpt[queens_index,]
 lpt_staten = lpt[staten_index,]

# Plotting the population and borough wise samples ##
boxplot(lpt$tree_dbh,horizontal=T,xlim=c(0,7),ylim=c(-100,350),col="green",xlab="Diamter (inches)")

 boxplot(lpt_manhattan$tree_dbh,horizontal = T,add=T,at=2,col="red")
 text(-50,2,"Manhattan")

 boxplot(lpt_bronx$tree_dbh,horizontal = T,add=T,at=3,col="pink")
 text(-50,3,"Bronx")

 boxplot(lpt_brooklyn$tree_dbh,horizontal = T,add=T,at=4)
 text(-50,4,"Brooklyn")

 boxplot(lpt_queens$tree_dbh,horizontal = T,add=T,at=5)
 text(-50,5,"Queens")

 boxplot(lpt_staten$tree_dbh,horizontal = T,add=T,at=6)
 text(-60,6,"Staten Island")

 abline(v=lpt_true_mu,lty=2)
 text(30,0,"True Mean = 21.56 inches")

“Brooklyn, Queens and Staten Island resemble the population distribution. Manhattan and the Bronx are biased. If we want to understand the population characteristics, we need a random sample that covers all the five boroughs. The data seems to be stratified. Why don’t I sample data using different strategies and see how closely they represent the population,” she thought.

Simple Random Sampling

She starts with simple random sampling.

“I will use all five boroughs as my sampling frame. Each of these 87014 trees has an equal chance of being selected. Let me randomly select 30 trees from this frame using sampling without replacement. To show the variety, I will repeat this 10 times.”

She types up a few lines to execute this strategy.

# 1: Simple Random Sample
 population_index = 1:N

 ntimes = 10
 nsamples = 30

 simple_random_samples = matrix(NA,nrow=nsamples,ncol=ntimes)
 for (i in 1:ntimes)
 {
   # a random sample of 30 # 
   sample_index = sample(population_index,nsamples,replace=F)
   simple_random_samples[,i] = lpt$tree_dbh[sample_index]
 }

boxplot(lpt$tree_dbh,horizontal=T,xlim=c(0,11),ylim=c(-100,350),col="green",xlab="Diamter (inches)",main="Simple Random Sampling")

 boxplot(simple_random_samples,horizontal = T,boxwex=0.5,add=T,at=c(2:11),col="pink")

 abline(v=lpt_true_mu,lty=2,col="red")
 text(30,0,"True Mean = 21.56 inches")

The sampling is done ten times. These ten samples are shown in the pink boxes against the true distribution, the green box.

Most of the samples cover the true distribution. It is safe to say that the simple random sampling method is providing a reasonable representation of the population.

Stratified Random Sampling

“Let me now divide the population into strata or groups. I will have separate sampling frames for the five boroughs and will do a simple random sampling from each stratum. Since I know the percentages of the trees in each borough, I can roughly sample in that proportion and combine all the samples from the strata into a full representative sample. An inference from this combined samples is what has to be done, not on individual strata samples.”

# 2: Stratified Random Sample
 population_index = 1:N

 ntimes = 10
 nsamples = 100

 stratified_random_samples = matrix(NA,nrow=nsamples,ncol=ntimes)
 for (i in 1:ntimes)
 {
   # Manhattan # 
   ns_manhattan = round(nsamples*p_boro[1])   
   ns_bronx = round(nsamples*p_boro[2])
   ns_brooklyn = round(nsamples*p_boro[3])   
   ns_queens = round(nsamples*p_boro[4])
   ns_staten = nsamples - sum(ns_manhattan,ns_bronx,ns_brooklyn,ns_queens)
 
   sample_manhattan = sample(manhattan_index,ns_manhattan,replace=F)
   sample_bronx = sample(bronx_index,ns_bronx,replace=F)
   sample_brooklyn = sample(brooklyn_index,ns_brooklyn,replace=F)
   sample_queens = sample(queens_index,ns_queens,replace=F)
   sample_staten = sample(staten_index,ns_staten,replace=F)
 
   full_sample = c(lpt$tree_dbh[sample_manhattan],lpt$tree_dbh[sample_bronx],lpt$tree_dbh[sample_brooklyn],lpt$tree_dbh[sample_queens],lpt$tree_dbh[sample_staten])
   
   stratified_random_samples[,i] = full_sample
 }

boxplot(lpt$tree_dbh,horizontal=T,xlim=c(0,11),ylim=c(-100,350),col="green",xlab="Diamter (inches)",main="Stratified Random Sampling")

boxplot(stratified_random_samples,horizontal = T,boxwex=0.5,add=T,at=c(2:11),col="pink")

 abline(v=lpt_true_mu,lty=2,col="red")
 text(30,0,"True Mean = 21.56 inches")

Again, pretty good representation.

She was curious as to where Manhattan is in these samples. So she types a few lines to create this simple animation.

## Animation ##
 for (i in 1:10)
 {
 boxplot(lpt$tree_dbh,boxwex=0.3,horizontal=T,xlim=c(0,2),ylim=c(0,350),col="green",xlab="Diamter (inches)",main="Stratified Random Sampling")

   abline(v=lpt_true_mu,lty=2,col="red")
   text(50,0,"True Mean = 21.56 inches")
   legend(250,2,cex=0.76,pch=0,col="red","Manhattan")
   
stripchart(stratified_random_samples[6:100,i],add=T,at=1.5,cex=0.6,col="blue")

stripchart(stratified_random_samples[1:5,i],add=T,at=1.5,cex=0.75,col="red")

Sys.sleep(1)
 }

The animation is showing the samples each time. Manhattan samples are shown in red. Did you notice that the Manhattan samples are mostly from the left tail? Unless it is combined with the other strata, we will not get a full representative sample.

Cluster Random Sampling

“Both of these methods seem to give good representative samples. Let me now check the cluster random sampling method. We have the zip codes for each tree. So I will imagine that each zip code is a cluster and randomly select some zip codes using the simple random sampling method. All the trees in these zip codes will then be my sample.”

This is her code for the cluster sampling method. She first identifies all the zip codes and then randomly samples from them.

# 3: Cluster Random Sample
 zips = table(lpt$zipcode)
 list_zips = as.numeric(names(zips))
 list_zips = list_zips[-1]

 ntimes = 10
 nsamples = 10

 cluster_random_samples =  NULL
 for (i in 1:ntimes)
 {
   cluster_index = sample(list_zips,nsamples,replace=F)
   cluster_sample = NULL
   for (j in 1:length(cluster_index))
   {
     ind = which(lpt$zipcode==cluster_index[j])
     cluster_sample = c(cluster_sample,lpt$tree_dbh[ind])
   }
   cluster_random_samples = cbind(cluster_random_samples,cluster_sample)
 }
 boxplot(lpt$tree_dbh,horizontal=T,xlim=c(0,11),ylim=c(-100,350),col="green",xlab="Diamter (inches)",main="Cluster Random Sampling")

 boxplot(cluster_random_samples,horizontal = T,boxwex=0.5,add=T,at=c(2:11),col="pink",axes=F)

 abline(v=lpt_true_mu,lty=2,col="red")
 text(30,0,"True Mean = 21.56 inches")

“Hmm. There seems to be one sample which is biased. It is possible. We could have selected most of the zip codes from Manhattan or the Bronx. Since there is not much variability within these clusters, we could not represent the entire population. There is a risk of running poor inferences if we use this sample. Nevertheless, most times we get a good representative sample.”

Systematic Random Sampling

“Let me also try the systematic random sampling method just for completion. I will select every 870th tree in the line up of 87014 trees. I will do this 10 times moving one up, i.e., for the second try, I will select every 871st tree and so forth. Since I am covering the full range of the data, we should get a good representation.”

She types up a few simple lines.

# 4: Systematic Random Sample
 ntimes = 10
 nsamples = 100
 
systematic_random_samples = NULL
 for (i in 1:ntimes)
 {
   # a random sample of 30 # 
 systematic_index = seq(from = i, to = N, by = round((N/nsamples)))
 systematic_random_samples = cbind(systematic_random_samples,lpt$tree_dbh[systematic_index])
 }

boxplot(lpt$tree_dbh,horizontal=T,xlim=c(0,11),ylim=c(-100,350),col="green",xlab="Diamter (inches)",main="Systematic Random Sampling")

 boxplot(systematic_random_samples,horizontal = T,boxwex=0.5,add=T,at=c(2:11),col="pink",axes=F)

 abline(v=lpt_true_mu,lty=2,col="red")
 text(30,0,"True Mean = 21.56 inches")

Things are good for this too.

“This dataset is an excellent resource to demonstrate so many concepts. I will use it to create a situational analysis for my class. I think I can show this over three weeks. Each week, I will introduce one concept. This Monday, I will take a small sample obtained from the simple random sampling method to the class. With this sample data, I will explain the basics of how to compute and visualize confidence intervals in R. I can also show how confidence intervals are derived using the bootstrap method.
Then, I will send them off on a task, asking them to collect data at random from the city. It will be fun hugging trees and collecting data. I should make sure I tell them that the data has to be collected at random from the entire City, not just a few boroughs. Then, all their samples will be representative, and I can show them the probabilistic interpretation of the confidence intervals, the real meaning.
After this, I will ask them to go out and take the samples again, only this time, I will emphasize that they should collect it from their respective boroughs. This means, the folks who collect data from Manhattan will bring biased samples and they will easily fall out in the confidence interval test. This will confuse them !! but it will give a good platform to explain sampling bias and the various strategies of sampling.”

Someday, I shall be gone; yet, for the little I had to say, the world might bear, forever its tone.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 82 – Riding with confidence, in R: Week 3

On the eleventh day of February 2019, Jenny showed her students how to compute and visualize confidence intervals in R. Her demo included the confidence interval on the mean, variance/standard deviation, and proportions. She also presented the code to develop bootstrap confidence intervals for any parameter. All this was based on a random sample that she collected.

But she wanted the students to have hands-on experience of data gathering and know the real meaning of confidence intervals, in that, for a 95% level, there is a 95% probability of selecting a sample for which the confidence interval will contain the true parameter value, $\mu, \sigma^{2}$ or p. So she sent them out to collect data through random sampling. The 40 students each brought back samples. 40 different samples of 30 trees each.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the eighteenth day of February 2019, the students came back with the samples. They all developed confidence intervals from their samples. Jenny showed them how the sample mean was converging to the true mean, and that over 40 confidence intervals, roughly two (5%) may not contain the truth. They also learned how the interval shrinks as one gets more and more samples.

Then, Jenny wanted them to understand the issues with sampling. So she sent them off for a second time. This time, the students divided themselves into teams, visiting different boroughs and collecting samples only from that borough.

Today, they are all back with their new samples.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Monday, February 25, 2019

In which the students come back with new samples, borough wise. Jenny explains the traps due to sampling bias and the basic types of sampling strategies.

It is Samantha who is always at the forefront of the class. She was leading seven other students in team Manhattan. Each of these eight students gathered data for 30 trees in Manhattan. So, they create eight different confidence intervals — representing the intervals from Manhattan. Samantha shows her example — the locations of the data that she gathered and the confidence interval of the true mean.

The 95% confidence interval of the true mean diameter $\mu$ is $\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$

“This sample may or may not contain the true mean diameter, but I know that there is a 95% probability of selecting this sample which when used to develop the confidence intervals, will contain the truth,” she said.

John was heading team Bronx. He followed Samantha to show the places he visited and the confidence interval from his sample. His team also had eight students, and each of them gathered data for 30 trees from the Bronx.

John strongly feels that he may be the unfortunate one whose confidence interval will not contain the truth. He may be one among the two. Let’s see.

The leaders of the other three teams, Justin, Harry, and Susan also prepare their confidence intervals. They all go up to the board and display their intervals.

“From last week, we know that the approximation for the true mean $\mu$ is 21.3575,” said Samantha as she projects the vertical line to show it. As you all know, Jenny showed last week that the sample mean was converging to 21.3575 as the samples increased. The principle of consistency.

As they discuss among themselves, Jenny entered the room and gazed at the display. She knew something that the students did not know. But she concealed her amusement and passed along the file to collect the data from all 40 students.

Like last week, the students fill out the file with the data they collected, borough wise. The file will have a total of 1200 row entries. Here is the file. As you have rightly observed, the boro code column is now ordered since the file went from team Manhattan to team Staten Island.

Jenny used the same lines as last week to create a plot to display all the 40 confidence intervals.

Assuming that the file is in the same folder you set your working directory to, you can use this line to read the data into R workspace.

# Reading the tree data file from the students - boro wise #
students_data_boro = read.csv("students_data_borough.csv",header=T)

Use these lines to plot all the confidence intervals.

# Jenny puts it all together and shows #
# 40 students went to collect data from specific boroughs in NYC #
# Each student collects 30 samples #
# Each student develops the confidence intervals #
 
 nstudents = 40
 nsamples = 30
 
 students = 1:nstudents

 alpha = 0.95
 t = alpha + (1-alpha)/2
 
 sample_mean = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd = matrix(NA,nrow=nstudents,ncol=1)
 ci_t = matrix(NA,nrow=nstudents,ncol=2)
 
 for(i in 1:nstudents)
 {
   ind = which(students_data_boro$student_index == i)
 sample_data = students_data_boro$tree_dbh[ind]
 sample_mean[i] = mean(sample_data)
 sample_sd[i] = sd(sample_data)
 cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))   
 cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
 ci_t[i,] = c(cit_lb,cit_ub)
 }

# Plot all the CIs #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(9,30),ylim=c(1,nstudents))
 
 stripchart(sample_mean[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],col="green", add=T, at = i)
 }

Once you execute these lines, you will also see this plot in your plot space.

“It looks somewhat different than the one we got last time,” said John.
“Let me see how many of them will contain the truth,” he added as he typed these lines.

He looks through all the confidence intervals for whether or not they cover the truth using an if statement. He calls them “false_samples.” Then, he plots all the confidence intervals once again, but this time, he uses a red color for the false samples. He also added the borough names to give a reference point.

nyc_random_truth = 21.3575

false_samples = students

 for (i in 1:nstudents)
 {
   if( (ci_t[i,1] > nyc_random_truth) || (ci_t[i,2] < nyc_random_truth) ) 
 {false_samples[i]=1} else 
 {false_samples[i]=0}
 }

false_ind = which(false_samples == 1)

# Plot all the CIs; now show the false samples #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(9,30),ylim=c(1,nstudents))

 stripchart(sample_mean[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],add=T,col="green",at = i)
 }

 abline(v=nyc_random_truth,lwd=3)

 for (i in 1:length(false_ind))
 {
   j = false_ind[i]
 stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
 }

 text(18,4,"Manhattan")
 text(24,12,"Bronx")
 text(27,20,"Brooklyn")
 text(28,28,"Queens")
 text(28,37,"Staten Island")

Try it yourself. You will also see this plot.

The students are puzzled. Clearly, there are more than 5% intervals that do not cover the truth. Why?

Jenny explains sampling bias

Jenny now explains to them about sampling bias. She starts with a map of all the data that the students brought.

We will get a more detailed explanation for creating maps in R in some later lessons. For now, you can type these lines that Jenny used to create a map.

### Full Map ###
 library(maps)
 library(maptools)
 library(RColorBrewer)
 library(classInt)
 library(gpclib)
 library(mapdata)
 library(fields)
 
 plotvar <- students_data_boro$tree_dbh
 
 nclr <- 6 # Define number of colours to be used in plot
 
 plotclr <- brewer.pal(nclr,"Greens") # Define colour palette to be used
 
# Define colour intervals and colour code variable for plotting
 class <- classIntervals(plotvar, nclr, style = "quantile")
 colcode <- findColours(class, plotclr)
 plot(students_data_boro$longitude,students_data_boro$Latitude,cex=0.55,pch=15, col = colcode, xlab="Longitude",ylab="Latitude",font=2,font.lab=2)

 map("county",regions ="New York",add=T)

 title("London Planetrees")

 legend("topleft", legend = names(attr(colcode, "table")), fill = attr(colcode, "palette"), cex = 1, bty = "n",title="Diameter (inches)")

“Look at this map. I am showing the location of the tree based on the latitude and longitude you all recorded. Then, for each point, I am also showing the diameter of the tree using a color bar. Thick trees, i.e., those with larger diameters are shown in darker green. Likewise, thin trees are shown in lighter green. Do you notice anything?” asked Jenny.

Samantha noticed it right away. “The trees in Manhattan have smaller diameters. Mostly, they have dull green shade,” she said.

“Precisely,” Jenny continues. “The trees are not all randomly distributed. They are somewhat clustered with smaller diameters in Manhattan and Bronx and larger diameters in the other boroughs.

Since you collected all your samples from a specific borough, there is a risk of sampling bias.

We can make good inferences about the population only if the sample is representative of the population as a whole.

In other words, the distribution of the sample must be like the distribution of the population from which it comes. In our case, the trees in Manhattan are not fully representative of the entire trees in the City. There was sampling bias, a tendency to collect a sample that is not entirely representative of the population.

For team Manhattan, since the distribution of your sample is dissimilar to that of the population, your statements about the truth are not accurate. You will have a bias — poor inference.

See, the sample mean also does not converge. Even at n=1200, there is still some element of variability.

Last week when you collected data for the trees, I asked you to gather them randomly by making sure that you visit all the boroughs. In other words, I asked you to collect random samples. Randomly visiting all the boroughs will avoid the issues arising from sampling bias. They give a more representative sample and minimize the errors in the inference. That is why we did not see this bias last week.”

“Are there types of sampling?” asked Justin.

Jenny replied. “At the very basic level, “simple random sampling” method, “stratified random sampling” method and “cluster random sampling” method. One of these days, I will show you how to implement these sampling strategies in R. For now let’s talk about their basics.

What you did in the first week was a simple random sampling method. Your sampling frame was all possible London planetrees in NYC. Each tree in this frame has an equal chance of being selected. From this frame, you randomly selected 30 trees. This is sampling without replacement. You did not take the measurements for the same tree two times. Computationally, what you did was to draw without replacement, a sequence of n random numbers from 1 to N. Mostly, you will get an equal proportion of trees from each borough.

Then there is the stratified random sampling method. Here, we can divide the population into strata — subpopulations or separate sampling frames. Within each frame or strata, we can do simple random sampling to collect data. The number of samples taken from each stratum or subpopulation is proportional to the size of the stratum. In other words, if we know the percent number of trees in Manhattan compared to the total number of trees, we can approximately sample that percentage from the Manhattan strata. One thing I can do is to assume that each of your teams got a simple random sample from a stratum, and combine the five teams to give me a full representative sample. An inference from this combined sampled will be more accurate than individual strata samples.

In the cluster random sampling method, we can first divide the population into clusters and then randomly select some clusters. The data from these clusters will make up the sample. Imagine that we divide the city into zip codes — each zip code is a cluster. Then we can randomly select some zip codes as part of our sampling strategy. All the trees in these zip codes make up our cluster random sample. However, if there is not much variability within each cluster, we run the risk of not representing the entire population, and hence poor inference.

We can also do systematic sampling, like selecting every 10th tree, but again, we should ensure that we cover the range. If not, we might get a biased sample.”

“How did you know that the borough wise sampling would be biased?” asked someone.

Well, for a one-line answer, you can say it was an educated guess. For a one-lesson answer, you should wait till next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 81 – Riding with confidence, in R: Week 2

Monday, February 18, 2019

In which the students come back with samples. Jenny shows them sampling distributions, the real meaning of confidence interval, and a few other exciting things using their data. She sends them off on a second task.

Where is the Mean

Samantha, John, and Christina explore

Samantha: These are the London planetrees I have data for.

The $100(1-\alpha)\%$ confidence interval of the population mean ( $\mu$ ) is the interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .

I have a sample of 30 trees. For these 30 trees, the sample mean is 20.6 inches and the sample standard deviation is 9.06 inches. Based on this, the confidence interval of $\mu$ is $[20.6 - 2.05\frac{9.06}{\sqrt{30}}, 20.6 + 2.05\frac{9.06}{\sqrt{30}}]$ .

[17.22 inches, 23.98 inches]

John: I collected data for 30 trees too. Here is my data.

And here is the confidence interval I came up with; [17.68 inches, 24.05 inches]. For my data, the sample mean is 20.87 inches and the sample standard deviation is 8.52 inches.

Christina: I have a sample of 30 too. Here is where I took them from.

And, here is the 95% confidence interval of the mean; [19.9 inches, 24.9 inches].

Their sample statistics are different. Their confidence intervals are different. They begin to wonder whether their interval contains the truth, $\mu$ , or, whose interval contains the truth?

Jenny puts all the intervals in context

Jenny shares with the students, an empty data file with the following headers.

The students fill out the file with the data they collected last week. Her class has 40 students, so when the file came back to Jenny, it had a total of 40*30 = 1200 row entries.

Jenny uploaded this file here, for the analysis she is going to show the class.

Jenny: I am sure you all had much fun last week visiting different places in the city and collecting data for the analysis. I am hoping that all of you randomly selected the places to visit. Based on what Sam and Christina showed, it looks like the places are fairly spread out — so we would have gotten a random sample from the population of the trees.

Sam, John, and Christina; the three of them computed the 95% confidence interval of the true mean $\mu$ based on their sample statistics. They found their intervals to be different.

Let’s look at all the 40 confidence intervals. Now that I have the data from you all, I will show you how to do it in R. Essentially, I will take each one of your data, compute the sample statistics and use them to compute the respective confidence intervals. We have 40 different samples, so we will get 40 different confidence intervals — each interval is a statement about the truth. If you remember what we have been discussing about the confidence intervals, for a 95% confidence level,

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of $\mu$ .

In other words, approximately 95% of the 40 samples (38 of them) may contain the truth. 5% (2 of them) may not contain the truth.

Let’s see who is covering the truth and who is not!

I want you all to work out the code with me. Here we go.

First, we will read the data file into R workspace.

# Reading the tree data file from the students # 
 students_data = read.csv("students_data.csv",header=T)

Next, use the following lines to compute the confidence interval for each student. I am essentially repeating the computation of sample mean, sample standard deviation and the confidence intervals, in a loop, 40 times, one for each student.

#40 students went to collect data from random locations in NYC # 
#Each student collects 30 samples #
#Each student develops the confidence intervals #
 
 nstudents = 40
 nsamples = 30
 
 students = 1:nstudents

 alpha = 0.95
 t = alpha + (1-alpha)/2
 
 sample_mean = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd = matrix(NA,nrow=nstudents,ncol=1)
 
 ci_t = matrix(NA,nrow=nstudents,ncol=2)
 
 for(i in 1:nstudents)
  {
  ind = which(students_data$student_index == i)
  
  sample_data = students_data$tree_dbh[ind]
  
  sample_mean[i] = mean(sample_data)
  sample_sd[i] = sd(sample_data)
  
  cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))   
  cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
  
  ci_t[i,] = c(cit_lb,cit_ub)
  }

Now, let’s plot the 40 intervals to see them better. A picture is worth a thousand numbers. Use these lines. They explain themselves.

#Plot all the CIs #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")

for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],col="green", add=T, at = i)
 }

The students are overjoyed looking at the pretty image.

John: This is neat. All our intervals are different, owing to each one of us bringing a different random sample. How we know which one of us contains the truth and which one of us does not is still not clear.

Jenny: We can get a close approximation based on the principle of consistency.

$\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0$

As n approaches infinity, the sample estimate approaches the true parameter. We can take the full data record that you all brought, 1200 data points and compute the overall mean.

true_mu = mean(students_data$tree_dbh)

While it is not exactly the true value, based on the idea of consistency, we can assume that it is converging to the true one.

$\mu$ = 21.3575 inches.

Look at this.

# Truth Convergence Plot #
 true_mean = sample_mean
 for (i in 1:nstudents)
 {
   ind = which(students_data$student_index <= i)
   true_mean[i] = mean(students_data$tree_dbh[ind])
 }

plot(true_mean,type="o",xlab="Increasing Sample",ylab="Convergence of the Sample Mean",font=2,font.lab=2,ylim=c(20,22))

My sixth sense and divine visions also tell me that this is converging to the truth!

Now, we can check which of these intervals contains the true mean $\mu$ , 21.3575 inches.

You can use this simple code to check.

false_samples = students
 for (i in 1:nstudents)
 {
   if( (ci_t[i,1] > true_mu) || (ci_t[i,2] < true_mu) ) 
 {false_samples[i]=1} else    
  {false_samples[i]=0}
 }
 false_ind = which(false_samples == 1)

It looks like number 4 and number 26 are not covering the truth. Harry and Miguel.

Let us point them out. Here, look at another pretty plot. The thick black vertical line is the truth, 21.3575 inches. The brown line is how the sample mean ( $\bar{x}$ ) is converging to the truth ( $\mu$ ) as we collect more samples. Harry and Miguel stand out. That is 2 out of 40; 5%.

# Plot all the CIs; now show the false samples #
 stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
 stripchart(sample_mean[1],add=T,col="green")
 
for (i in 2:nstudents)
 {
   stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_mean[i],add=T,col="green",at = i)
 }

 abline(v=true_mu,lwd=3)
 lines(true_mean,students,type="o",col="brown",cex=0.5)

 for (i in 1:length(false_ind))
 {
   j = false_ind[i]

 stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
 }

 text(16,4.6,"Harry")
 text(26,26.6,"Miguel")

There is a 95% probability of selecting a sample whose confidence interval will contain the true value of $\mu$ .

Miguel: What happens to the confidence interval when we compute it with larger sample sizes? I mean, what is the relationship between the confidence interval and sample size?

Jenny: Look at the equation for the confidence interval of the population mean; $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .

Theoretically, as n tends to infinity, the interval shrinks to 0 since the sample mean converges to the true mean. So, as n gets larger, the interval gets smaller. Let me show you how it works. Take our data, and, just as we computed the sample mean for increasing sample size to see the convergence, compute the confidence interval also at each step.

# How does CI change with more sample size
 sample_mean_update = matrix(NA,nrow=nstudents,ncol=1)
 sample_sd_update = matrix(NA,nrow=nstudents,ncol=1)
 
ci_t_update = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
 ind = which(students_data$student_index <= i)
 ns = length(ind)

 sample_data = students_data$tree_dbh[ind]

 sample_mean_update[i] = mean(sample_data)
 sample_sd_update[i] = sd(sample_data)

cit_lb = sample_mean_update[i] - qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))   
cit_ub = sample_mean_update[i] + qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))
 
ci_t_update[i,] = c(cit_lb,cit_ub)
 }

Now, we can plot these confidence intervals against increasing sample size to see what happens. Just remember that each interval is an interval that is based on an increased sample size than the previous. The first interval has 30 data points, the second one has 60 data points, the third one, 90, the fourth, 120 etc.

# Plot all the CIs #
 stripchart(ci_t_update[1,],vertical=T,type="l",col="green",main="",ylab="Diameter (inches)",ylim=c(16,25),xlim=c(1,nstudents),xlab="Increasing Sample Size")
stripchart(sample_mean_update[1],vertical=T,add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_t_update[i,],type="l",col="green",main="",add=T,at = i,vertical=T)
   stripchart(sample_mean_update[i],col="green", add=T, at = i,vertical=T)
 }
 
lines(students,true_mean,type="o",col="brown",cex=0.5)

See how the interval gets smaller and smaller as the sample size increases. If we do this enough number of times, the interval vanishes — as we will approach the entire population.

T or Z

Christina: If we have a large sample, why can’t we use $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ as the $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ ? For a large sample size, we can assume s, the sample standard deviation of the large sample as $\sigma$ .

Jenny: Yes, we can do that. As n increases, the difference between T and Z is minimal. However, it is better to use t-distribution for confidence intervals of $\mu$ , so we don’t fall into traps. I will show those traps next week.

Where is the Standard Deviation and Proportion

Jenny: Can you follow the same logic and develop the confidence intervals for $\sigma$ ? How many of you will cover the truth, how many of you will not?

Again, approximately, 95% of the intervals will cover the true $\sigma$ , i.e., roughly 38 of your intervals will have the truth, 2 will not. Check this out.

alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u
 
sample_var = matrix(NA,nrow=nstudents,ncol=1)
ci_sd = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
  ind = which(students_data$student_index == i)
  ns = length(ind)
 
  sample_data = students_data$tree_dbh[ind]

  sample_var[i] = var(sample_data)

  chi_low = ((ns-1)sample_var[i])/qchisq(u,df=(ns-1))   
  chi_up = ((ns-1)sample_var[i])/qchisq(l,df=(ns-1))

  ci_sd[i,] = c(sqrt(chi_low),sqrt(chi_up))
 }

true_sd = sqrt((sum((students_data$tree_dbh-true_mu)^2))/length(students_data$tree_dbh))

# False samples #
false_samples_sd = students
 for (i in 1:nstudents)
 {
   if( (ci_sd[i,1] > true_sd) || (ci_sd[i,2] < true_sd) ) 
 {false_samples_sd[i]=1} else 
 {false_samples_sd[i]=0}
 }

false_sd_ind = which(false_samples_sd == 1)

# Plot all the CIs; now show the false samples #
stripchart(ci_sd[1,],type="l",col="green",main="CI on Standard Deviation",xlab="Diameter (inches)",xlim=c(5,15),ylim=c(1,nstudents))
stripchart(sample_sd[1],add=T,col="green")
 
for (i in 2:nstudents)
 {
   stripchart(ci_sd[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_sd[i],add=T,col="green",at = i)
 }
 
abline(v=true_sd,lwd=3)

Jenny: Here all the intervals cover the truth, but, in the long-run, 95% of the intervals cover the truth.

John: How about proportion?

Jenny: Yes, can you all develop the confidence intervals for the proportion of damaged trees? I want you to use bootstrap confidence intervals for p instead of the one based on the assumption of normal distribution.

So the students typed away a few lines of code to finally create this.

# Confidence Interval of p #
 alpha = 0.95
 u = alpha + (1-alpha)/2
 l = 1 - u
 
nboot = 1000

 sample_p = matrix(NA,nrow=nstudents,ncol=1)
 ci_p_boot = matrix(NA,nrow=nstudents,ncol=2)

 for(i in 1:nstudents)
 {
   ind = which(students_data$student_index == i)
 ns = length(ind)

 sample_data = students_data$brnch_ligh[ind]

 sample_p[i] = length(which(sample_data=="Yes"))/ns

 bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)
 for (j in 1:nboot)
     {
       ind2 = 1:ns
   bootstrap_ind = sample(ind2,ns,replace=T)   
bootstrap_sample_damage = sample_data[bootstrap_ind]   
bootstrap_replicate_proportion[j,1] = length(which(bootstrap_sample_damage=="Yes"))/ns }
 
ci_p_boot[i,] = quantile(bootstrap_replicate_proportion,c(l,u))
 }
 
true_p = length(which(students_data$brnch_ligh=="Yes"))/length(students_data$brnch_ligh)

# False samples #
 false_samples_p = students
 for (i in 1:nstudents)
 {
   if( (ci_p_boot[i,1] > true_p) || (ci_p_boot[i,2] < true_p) ) 
 {false_samples_p[i]=1} else 
 {false_samples_p[i]=0}
 }
 
false_p_ind = which(false_samples_p == 1)

# Plot all the CIs; now show the false samples #
 stripchart(ci_p_boot[1,],type="l",col="green",main="CI on Proportion",xlab="Proportion",xlim=c(0,0.5),ylim=c(1,nstudents))
 stripchart(sample_p[1],add=T,col="green")

 for (i in 2:nstudents)
 {
   stripchart(ci_p_boot[i,],type="l",col="green",main="",add=T,at = i)
   stripchart(sample_p[i],add=T,col="green",at = i)
 }

 abline(v=true_p,lwd=3)

 for (i in 1:length(false_p_ind))
 {
   j = false_p_ind[i]
 stripchart(ci_p_boot[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
   stripchart(sample_p[j],col="red", lwd=3, add=T, at = j)
 }

You can also check out their full code here.

As the class comes to an end, Samantha looks fulfilled. “Wow, that is quite a day,” she said. “We covered many things today, but more importantly the true meaning of the confidence intervals. There is a 95% probability of selecting a sample whose confidence interval will contain the true value.”

“Assuming our data is collected without any bias,” interjected Jenny. The students looked puzzled.

“This week, I want you to go out and collect the data again, but this time, divide yourselves into boroughs. Eight of you will belong to team Manhattan and will collect tree data from Manhattan only. You will still collect data randomly but will do it only from Manhattan. Eight of you are team Queens and you collect data randomly from Queens. The other three teams, team Bronx, team Brooklyn, and team Staten Island will do the same in their boroughs. We will again have 30 trees each, and a total of 1200 trees.”

So, the students go off on another mission. Some of them are still puzzled or rather have a “what difference does it make” face on them. Some of them are as usual excited to hug the trees again. We will all wait and see what they come up with. More fun with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 80 – Riding with confidence, in R: Week 1

Monday, February 11, 2019

In which Jenny shows her students how to compute and visualize confidence intervals in R, and sends them off on a task.

“Did you know that the symbol of the department of parks and recreations is a cross between the leaf of the London plane and a maple leaf?” asked Jenny as she opened her high school data class. She volunteered to help 40 high school kids recently.

The obvious response from the kids is no response.

“It is,” she said. “More than 10% of the trees are London Planetrees. They grow up to 20 – 30 meters, with a trunk circumference of 3 meters.”

She projected a map on the big screen.

“These are the places I randomly selected to visit and record the diameter of the trunk of the London planetrees. You have this data in the handout. We will learn how to develop confidence intervals in R.”

For your analysis in R, the data file can be downloaded from here.”

Let’s Start with the Mean

“Can someone tell me how we can get the confidence interval of the mean diameter?” she asked.

John raised his hand to answer. “Yes, John, go ahead,” said Jenny.

“The $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ is $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ . If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown $\sigma$ . However, for small sample sizes, the sample standard deviation s is itself subject to error. It may be far from the true value of $\sigma$ . So it is preferable to use t-distribution for the confidence interval of the mean since $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ tends to a t-distribution with (n-1) degrees of freedom.

So I would say that the $100(1-\alpha)\%$ confidence interval of the population mean is the interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ .”

“Excellent. Let’s all compute the 95% confidence interval of the mean diameter of the trunk of the London planetrees from the sample I gave you. The 95% confidence interval of the true mean diameter ( $\mu$ ) is $\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$ .”

They all started calculating. Some of them pulled out t-tables to get the t-critical ( $t_{0.025,29}$ ), i.e., the quantile from the t-distribution corresponding to the upper tail probability of 0.025, and 29 degrees of freedom.

Jenny stopped them. “Don’t compute the interval by hand. Let’s do it in R.”

“Read the data file first. I am assuming that you all have the file in the same folder that you set the working directory to.”

# Read the data file -- Jenny's sample data #
jenny_sample_data = read.csv("jenny_data.csv",header=T)
nsamples = nrow(jenny_sample_data)

“Did you notice that the file has seven columns? The diameter of the tree trunk is recorded in the first column, tree_dbh. I am sure you figured out what the other columns are, except maybe the column that reads brnch_ligh that has Yes or No as the inputs. We will get to that in a bit.”

“Let’s compute the sample mean ( $\bar{x}$ ), sample variance ( $s^{2}$ ), and the sample standard deviation ( $s$ ).”

#statistics of the sample #
jenny_samplemu = mean(jenny_sample_data$tree_dbh)
jenny_samplevar = var(jenny_sample_data$tree_dbh)
jenny_samplesd = sd(jenny_sample_data$tree_dbh)

“What you were about to look up from the t-table can be obtained using this command in R.”

qt(0.975,df=29)

[1] 2.04523

“The t-critical value is 2.04523.”

“We can write a few lines to compute the confidence interval for any $\alpha$ level, and show them graphically.”

# Confidence interval of the mean using t-distribution #
alpha = 0.95 # 95% confidence interval 
t = alpha + (1-alpha)/2
n = nsamples

cit_lb = jenny_samplemu - qt(t,df=(n-1))*((jenny_samplesd/sqrt(n))) 
cit_ub = jenny_samplemu + qt(t,df=(n-1))*((jenny_samplesd/sqrt(n)))

ci_t = c(cit_lb,cit_ub)

stripchart(ci_t,type="l",lwd=3,col="green",main="CI on Mean",xlab="Diameter (inches)")
stripchart(jenny_samplemu,add=T,lwd=3,col="green")

“The sample mean ( $\bar{x}$ ) is 22.433 inches and the intervals extend out to 19.67 inches on the left and 25.20 inches on the right.”

“There is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ .”

The kids looked happy seeing the image on their screen.

Next, let’s explore the Variance and the Standard Deviation

“Do you remember the equation for the confidence interval of the variance and the standard deviation?” Jenny asked.

Samantha wanted to answer.

“The $100(1-\alpha)\%$ confidence interval of the population variance $\sigma^{2}$ is the interval $[\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]$ . We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation,” she said.

“The interval $[\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population standard deviation $\sigma$ .”

Jenny was pleased. She has an outstanding cohort. They always come prepared.

“Yes. In our case, $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with 29 degrees of freedom. The lower and upper critical values at the 95% confidence interval $\chi_{l,29}$ and $\chi_{u,29}$ can be obtained from the Chi-square table, or, as you guessed, can be computed in R using a simple command. Try these,” she said.

qchisq(0.975,df=29)

[1] 45.72229

qchisq(0.025,df=29)

[1] 16.04707

“Of course, we can automate this and make nice graphics for the intervals. Use this code.”

# Confidence interval of the variance and standard deviation using Chisq-distribution #
u = alpha + (1-alpha)/2
l = 1 - u

chi_low = ((n-1)*jenny_samplevar)/qchisq(u,df=(n-1)) 
chi_up = ((n-1)*jenny_samplevar)/qchisq(l,df=(n-1))

ci_var = c(chi_low,chi_up)
ci_sd = c(sqrt(chi_low),sqrt(chi_up))

stripchart(ci_var,type="l",lwd=3,col="green",main="CI on Variance",xlab="Diameter^2 (inches^2)")
stripchart(jenny_samplevar,add=T,lwd=3,col="green")

stripchart(ci_sd,type="l",lwd=3,col="green",main="CI on Standard Deviation",xlab="Diameter (inches)")
stripchart(jenny_samplesd,add=T,lwd=3,col="green")

“The sample variance is 54.74 $inches^{2}$ , with a lower bound of 34.72 $inches^{2}$ and an upper bound of 98.92 $inches^{2}$ . Notice how the intervals are not symmetric. Do you remember why?”

“The sample standard deviation is 7.40 inches. The 95% lower and upper bounds are 5.90, and 9.95 inches respectively.”

The kids spent a few minutes typing up the lines of code to get the intervals and make the graphics.

Jenny continued the lesson.

“Now, look at the third column in the data file, brnch_ligh. When I measured the trunk circumference, I also noticed that, for some trees, there were branch problems caused by lights or wires. I recorded this as a “yes” or a “no” depending on whether or not there was a problem. Who would think to have nice colorful lights on the trees would be damaging 🙁

Can one of you count through the sample and tell me what is the estimate of the proportion of trees ( $\hat{p}$ ) that are damaged.”

“0.1,” someone answered from the back.

“Great, now, that invisible person, can you tell me how we can compute the confidence interval of the true proportion?”

Christina it was. She answered, “the $100(1-\alpha)%$ confidence interval for the true proportion p is $[\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ , assuming that the estimate $\hat{p}$ can be approximated by a normal distribution for a reasonably large sample size.”

“And, here is how you do that in R,” added Jenny, as she typed up these lines.

# Confidence interval of the proportion using z-distribution # 
jenny_sampleproportion = length(which(jenny_sample_data$brnch_ligh=="Yes"))/nsamples

z = alpha + (1-alpha)/2

p_low = jenny_sampleproportion - qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))
p_up = jenny_sampleproportion + qnorm(z)*sqrt((jenny_sampleproportion(1-jenny_sampleproportion))/(n))

ci_p = c(p_low,p_up)

stripchart(ci_p,type="l",lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="green")

“The sample estimated damage proportion ( $\hat{p}$ ) is 0.1 since there are 3 damaged trees in a sample of 30 trees. The lower bound and the upper bound of the 95% confidence interval are -0.007 and 0.207, symmetric around 0.1,” Jenny interpreted as she looked at the plot and then at the kids.

Some students were shaking their head in disagreement. They don’t like the fact that the confidence interval produced a negative value. What could have gone wrong?

Christina jumped up at once. “Didn’t we learn last week that when p is at the boundaries, the sampling distribution exhibits skew and is not symmetric. Then it is not appropriate to approximate it to a normal distribution even for large sample sizes. We should perhaps develop the sampling distribution and the confidence intervals of the damage proportion using the idea of the Bootstrap?”

“That is an astute observation Christina,” said Jenny. “Yes, since we have an estimate close to 0, it is better to develop the bootstrap confidence intervals. That will eliminate the errors induced due to inappropriate assumptions,” she added.

Let’s do a Bootstrap in R

“The basis for the bootstrap is that the sample data of 30 trees can be used to approximate the probability distribution function of the population. By putting a probability of 1/n on each data point, we use the discrete empirical distribution $\hat{f}$ as an approximation of the population distribution f. It is easy enough to think of drawing numbers with replacement from these 30 numbers. Since each value is equally likely, the bootstrap sample will consist of numbers from the original data, some may appear more than one time, and some may not appear at all in a random sample.” Jenny explained the core concept of the bootstrap once again before showing them the code to do it.

# Bootstrap confidence intervals ## 
nboot = 1000
bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)

bootstrap_replicate_mean = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_var = matrix(NA,nrow=nboot,ncol=1)
bootstrap_replicate_sd = matrix(NA,nrow=nboot,ncol=1)

for (i in 1:nboot)
 {
   ind = 1:nsamples
   bootstrap_ind = sample(ind,nsamples,replace=T)

   bootstrap_sample_damage = jenny_sample_data$brnch_ligh[bootstrap_ind]
   
   bootstrap_replicate_proportion[i,1] = length(which(bootstrap_sample_damage=="Yes"))/nsamples

   bootstrap_sample_diam = jenny_sample_data$tree_dbh[bootstrap_ind]

   bootstrap_replicate_mean[i,1] = mean(bootstrap_sample_diam)
   bootstrap_replicate_var[i,1] = var(bootstrap_sample_diam)
   bootstrap_replicate_sd[i,1] = sd(bootstrap_sample_diam)
 }

“Here, I am drawing the bootstrap sample 1000 times, and, for each bootstrap sample, I am computing the proportion, the mean, the variance, and the standard deviation. So, in the end, I will have 1000 replicates of the damage proportion, the sample mean, sample variance and standard deviation — our sampling distributions.

Drawing a bootstrap in R is very simple. Just use the “sample” command.

ind = 1:nsamples

sample(ind,nsamples,replace=T)

I first create an indicator vector that has numbers 1 to 30. From this vector, I draw samples with replacement to get numbers from 1 to 30, some may appear more than once, some may not appear, depending on the sample. These are the trees that we selected as part of the bootstrap sample. From these trees, we take the diameter and whether or not it is damaged due to lights. The rest of the lines in the loop are just to compute the statistics from the bootstrap sample.” Jenny clearly explained here lines.

“Now, let’s plot the distribution of the damage proportion derived from the bootstrap samples. Type these lines.”

# Distribution of the sample proportion #
hist(bootstrap_replicate_proportion,main="Bootstrap Replicates of the Damage Proportion",font=2,font.lab=2,xlab="Sample Proportion")

ci_p_boot = quantile(bootstrap_replicate_proportion,c(0.025,0.975))
abline(v=ci_p_boot,col="red")

“The 95% lower and the upper confidence limits are the 2.5th and the 97.5th percentiles of the sampling distribution. We can show them as red lines on the distribution.”

“Compare this to the previous confidence interval computed by assuming a normal distribution. We can see that the sampling distribution of the proportion is skewed and the interval is asymmetric. It is also not producing any negative values,” said Jenny as she showed a few more lines of tricks.

# compare with previous CI # 
stripchart(ci_p_boot,type="l",lwd=3,col="black",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,add=T,lwd=3,col="black")
text(0.17,1.1, "Bootstrap Confidence Interval")

stripchart(ci_p,type="l",add=T,at=0.7,lwd=3,col="green",main="CI on Proportion",xlab="Damage Proportion")
stripchart(jenny_sampleproportion,at=0.7,add=T,lwd=3,col="green")

“I have a few more things in here to compute the bootstrap confidence interval of the mean, variance, and the standard deviation. You can check them out for yourself.

Here is the full code.”

Jenny was preparing to end the class.

“Can we know the true mean, variance and the proportion?” Someone asked.

Jenny was happy to hear this. She had planned an assignment for them exactly for this.

“From this sample alone, we cannot say. We can, however, say that there is a 95% probability of selecting this sample for which the confidence interval will contain the true value of $\mu$ , $\sigma^{2}$ and $p$ . But, I want you to understand it by doing it yourself.

So, here’s the deal. This week, I want you to go out to the city and collect a sample of 30 trees. For each tree, I want you to measure the diameter and whether or not it is damaged. Each one of you will randomly select 30 locations in the city and bring back a sample. Make sure you more or less cover all the boroughs. In the end, we will have 40 different samples, each with 30 data points. We will have much fun analyzing that data, answering your questions, and getting a deeper understanding of the confidence intervals, in R.” Jenny ended her session by saying this.

So the kids left with excitement, to spend their spare time during the week collecting data for the trees in the City. You may encounter them this week. They might be hugging a tree or two. Don’t mistake them for some tree-hugging hippies. They are teaching you how to be confident with data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 70 – The quest for truth: Learning estimators in R, part II

T-2 days

Wilfred Jr. was typing away on his Macintosh. Keneth is writing something in his notebook.

“Do you have an estimate yet?” asked Wilfred. “In a minute.”

Wilfred Jr. and Keneth were busy preparing for the meeting in a couple of days. They are systematically analyzing the data they collected over the past few weeks. Their workplace was well lit with a strong beam of sunlight through the window. They both looked out, it was deceivingly pleasant. It is one of those newsworthy summer days. The local weatherman has already given it a rank 2 in his “new normals” show.

The wall behind the big screen computer is a full corkboard that has sticky notes and pins.

“We have an answer for the first two questions,” said Ken. “Last week, we checked this and found that the true proportion of dummies is somewhere around 2% – 0.02024.”

“There you go. Look at this.” Wilfred hit Command + Enter on his keyboard. “I like this version better,” he said.

Keneth looked at the code and wondered if Willy would explain to him what was going on.

“So, we are showing how the estimate for the proportion of dummies $\hat{p}=\frac{r}{n}$ changes as we add more data from the stores. The more the data (i.e., the stores we visit), the closer the estimate goes to the truth. After a while, the change in the value becomes small and the estimate stabilizes and converges to the truth. It is the consistency criterion of the estimates,” said Wilfred as he points out to the blue line on the graph.

“What are the red dots and the boxplot that shows up at the end?” asked Keneth.

“Good questions. Each red dots that appears vertically at the end is the estimated proportion for a given store. So, if a store has 2% dummies, the estimate for that store is 0.02, and a red dot appears at 0.02. Like this, we have 1000 different sample estimates for the proportion of dummies. The boxplot at the end is showing this distribution. The distribution of the sample proportions. Did you notice anything?” asked Wilfred.

Ken looked at the plot again, this time with more attention.

“The blue line meets the big red circle. I can see that the big red circle is the mean of the red dots. $E[\hat{p}]$ . The boxplot also looks asymmetric.” The expected value of the sample proportions and the long-run estimate are the same.”

$E[\hat{p}]=p=0.02024$

Wilfred nods in acceptance. “We discussed this last week. Since the estimate $\frac{r}{n}$ is an unbiased estimate of the true proportion, the mean of the sample distribution will be the true proportion. The long-run converged estimate is also the same value.”

“Can you share the code with me? I want to take some time and learn how to make these visuals.” Keneth is imagining how cool he would look in front of his friends when he presents his summer project.

“Sure. Here it is. I wrote comments for each line. It should be self-explanatory.”

# Graphic analysis for proportion #

# For each store, compute the sample proportion #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample proportion #
# This will converge and stabilize at the true parameter value #

# Reading the candy data file #
candy_data = read.table("candy_samples.txt",header=F)

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_proportion = NULL # as the sample increases, we re-compute the sample proportion of dummy candies

sample_p = matrix(NA,nrow=nstores,ncol=1) # saving the estimate from each store. The expected value of this vector = truth

plot(0,0,col="white",xlim=c(0,1300),ylim=c(0,0.08),xlab="Stores",ylab="Proportion of Dummies",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
p = length(dummies)/(ncandies*i)
sample_proportion = c(sample_proportion,p)

# storing the sample weight for the store in a matrix 
sample_p[i,1] = length(dummy_index)/ncandies

# plot showing the trace of the sample proportion -- converges to truth
points(i,sample_proportion[i],cex=0.075,col="blue")
points(1000,sample_p[i],cex=0.1,col="red")
Sys.sleep(0.05)
#print(i)
}

Sys.sleep(1)
points(1000,mean(sample_p),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_p,add=T,at=1050,boxwex=25,range=0)
Sys.sleep(1)
text(1200,mean(sample_p),expression(E(hat(p))==0.02),col="red",font=2)

What are the population characteristics?

Their attention now shifted to the final question. When the day started, they discussed some ideas and decided they will present the analysis that investigated for the true mean and the true variance of the population.

If the weight of all the Quality Candies is the population, then, it can be represented using the mean ( $\mu$ ) and the variance ( $\sigma^{2}$ ) of the population.

Willy and Ken’s task is to find these values. Estimate the parameters.

Here is what they did.

They visited the first shop and found that the estimated proportion of dummy candies in 0.02. As Ken separated the candies from the dummies, he also put them on a weighing scale and measured the weight of each candy.

The average weight of the 98 candies from the first store is 3.00748 grams.

Ken estimated this using the formula $\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}$ . This is the maximum likelihood estimator for the true mean.

They record this as the estimated sample mean for the first store.

They went to the second store. Here they did not find any dummies. The estimated sample mean weight of the 100 candies from this store is 3.0073 grams. This is another estimate of the true mean.

Overall, they now have 198 candies (a bigger sample). Ken also computed the sample mean for these 198 candies. This comes out as 3.007389 grams.

Like this, as they visited more and more stores, they re-computed the average weight of the candies using a larger sample. They also computed the individual estimate of the sample mean for that store.

Like in the case of the proportion of dummies, they want to check the convergence of the estimated mean.

“I think you can use the same code as before, but change the lines where you computed the proportions to computing the sample means,” said Ken as he is seen knee-deep into the lines of the code.

“You are correct! All I did was that and here we go. I changed the color of the converging line.”

“I will interpret this now,” said Ken. “The sample mean $\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}$ is an unbiased estimate of the population mean $\mu$ . I know it from Lesson 67. The big red circle is the expected value of the sample means, the center of the distribution. So this is $E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]$ . This is equal to the true mean $\mu$ . The orange line, the estimate with increasing sample size also converges to this number.”

$E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}] = \mu = 3$

“Excellent. What else do you observe? Anything from the boxplot?”

“Hmm, it does look symmetric. Is the distribution of the sample means normally distributed?” asked Ken.

“I will let you mull over that. You do know about the Central Limit Theorem, right?” Wilfred finished writing some comments in his code.

# Graphic analysis for mean #

# For each store, compute the sample mean #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample mean #
# This will converge and stabilize at the true parameter value #

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_meanweight = NULL # as the sample increases, we re-compute the sample mean of the candies

sample_mean = matrix(NA,nrow=nstores,ncol=1) # save the value of sample mean for each store

plot(0,0,xlim=c(0,1200),ylim=c(2.945,3.055),xlab="Stores",ylab="Average Candy Weight",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
xbar = mean(candies)
sample_meanweight = c(sample_meanweight,xbar)

# storing the sample weight for the store in a matrix 
sample_mean[i,1] = mean(sample_candies)

# plot showing the trace of the sample mean -- converges to truth
points(i,sample_meanweight[i],cex=0.15,col="orange")
points(1000,sample_mean[i],cex=0.15,col="red")
Sys.sleep(0.05)
print(i)
}

Sys.sleep(1)
points(1000,mean(sample_mean),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_mean,add=T,range=0,boxwex=25,at=1050)
Sys.sleep(1)
text(1150,mean(sample_mean),expression(E(hat(mu))==3),col="red",font=2)

“What about Variance?”

Asked Wilfred.

“Same procedure. In fact, when I computed the sample mean for each store and the overall incremental sample, I also computed the sample variances using its maximum likelihood estimator $\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}$ . I used this formula for each store and the overall incremental sample.”

“Give me a few minutes, I will code this up and show.” Ken looked excited. He finally got to write his code and look at the outputs. He typed up a few lines, mostly editing the previous versions.

Here is what they found.

Ken looked puzzled. He expected to see the same pattern, however, much to his surprise, he found that the expected value of the sample variance $E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}]$ and the long-run converged estimate for the variance are not the same.

$E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}] \neq \sigma^{2}$

“Is there a bias?” he asked.

Wilfred was expecting this. He had done his homework from Lesson 67!

“Yes, there is.”

“The maximum likelihood estimator $\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}$ is a biased estimator.”

“Let me show you why.”

“I will rewrite the estimator as $\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}$ by using $\bar{x}$ in place of $\mu$ .”

$\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}$

$=\frac{1}{n}\sum(x_{i}^{2} + \bar{x}^{2} -2\bar{x}x_{i})$

$=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} \sum \bar{x}^{2} - \frac{2}{n}\sum x_{i}\bar{x}$

$=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} n \bar{x}^{2} - 2 \bar{x} \bar{x}$

$=\frac{1}{n}\sum x_{i}^{2} + \bar{x}^{2} - 2 \bar{x}^{2}$

$=\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}$

$\hat{\sigma^{2}}=\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})$

“Now, let’s apply the expectation operator on this formula.”

$E[\hat{\sigma^{2}}]=E[\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})]$

$=\frac{1}{n}(\sum E[x_{i}^{2}] - n E[\bar{x}^{2}])$

$=\frac{1}{n}(\sum (V[x_{i}]+(E[x_{i}])^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))$

“I am sure you know why I wrote that step. It is from the property of variance operator.”

“Now, since $x_{1}, x_{2}, ..., x_{n}$ are independent and identically distributed random samples from the population with true parameters $\mu$ and $\sigma^{2}$ , they will have the same expected value and variance as the population. $E[x_{i}]=\mu$ and $V[x_{i}]=\sigma^{2}$ . I will use this simple idea to reduce our equation.”

$E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))$

“If you go back and read Lesson 68 (Bias joins Variance), you will see that $V[\bar{x}]=\frac{\sigma^{2}}{n}$ . You already know that $E[\bar{x}]=\mu$ . Let’s substitute those here.”

$E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (\frac{\sigma^{2}}{n}+\mu^{2}))$

$E[\hat{\sigma^{2}}] = \frac{1}{n}(n\sigma^{2} + n\mu^{2} - \sigma^{2} -n\mu^{2})$

$E[\hat{\sigma^{2}}] = \frac{n-1}{n}\sigma^{2}$

“So, $\hat{\sigma^{2}}$ is biased by $-\frac{\sigma^{2}}{n}$ .”

“That is why you are seeing that the expected value of the sample variances is not the same as where the overall incremental estimate (population variance) converges. The estimator with $n$ in the denominator underestimates the true variance $\sigma^{2}$ because of the negative bias.”

If you want an unbiased estimator, you should use $\frac{1}{n-1}$ in the denominator instead of n.

The unbiased estimator for variance is $\hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}$

Wilfred added a few lines in the code and showed this plot to Ken.

“Look at this. I have added another boxplot to your figure. This boxplot is the distribution of the bias-adjusted (unbiased) variance estimates. The green dot is now the expected value of this distribution, which matches with the true variance.”

Ken looks at it with a happy face, but something seems to bother him.”

“What is it?” asked Jr.

“Is the distribution of variance asymmetric?” he asked.

“Yes. Is that all?”

“Well, if the variance is bias-adjusted to match the truth, does it also mean the standard deviation, which is $\sqrt{\hat{\sigma^{2}}}$ , will get adjusted?”

“That is a good question.”

No, although $\hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}$ is an unbiased estimator for $\sigma^{2}$ , $\hat{\sigma}=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}$ is not an unbiased estimator for $\sigma$ .

“Here, I re-ran the code with some changes to show this. The adjusted estimate still has some bias.”

“Very interesting. I guess we are ready for the meeting. The true mean weight of the candies is 3 grams and the true standard deviation of the weight of the candies is 0.2 grams. Can we have the code in the archives.” Ken starts packing his bag as he says this. He looks exhausted too.

“Here is the full code.”

T-0 days

Mr. Alfred Quality and Wilfred Quality Sr. were disappointed to know that there is a systematic packing error, although it is only 2%. They deployed their engineering team to fix this immediately.

They were nevertheless very impressed with the quality work of Wilfred Quality Jr. He owned the problem and came up with the answers. All along, he had collected a large sample that can be used for future tests, trained an enthusiastic high-school senior and showed that he is ready to be the next Mr. Quality for Quality Candies.

As he was leaving the boardroom, Mr. Alfred Quality asked him:
“Junior, great work, but how confident are you that the true value lies within two standard deviations?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 69 – The quest for truth: Learning estimators in R, part I

For the fifteenth time, Mr. Alfred Quality looks at the two candies in front of him and shakes his head. He is visibly disgruntled. He spent 50 years selling quality candies in North Dakota. The kids love it. Quality Candies are local, and classic North Dakotan. Now, this new “Classic Series Pineapple Candy” seems to have deviated from his expectations.

There is a knock on the door. “That must be Wilfred,” thought Mr. Quality in his mind as he speaks in an elevated voice, “Come on in.”

Wilfred Quality, son of Alfred Quality gently opens the doors and walks in. He is well dressed and looks around 50 years old. He could notice the displeasure of his father as he steps closer.

“Son, I have noticed a lapse in quality in this Classic Series Candy. There is a candy with no candy. What would our customers think? We need immediate action on this. I want to know what proportion of our candy in the stores are dummies. We have to fix this before the customers accuse us of cheating by filling in empty candies. Is this a packing machine error, or is it a trick to adjust for inflation?”

“I will look into it. It must be a machine error. There would be no such inflation adjustment tricks, I assure you,” said Wilfred who is the current CEO of Quality Candies. He is running the business very efficiently since his father took golf seriously.

“I will ask Junior to collect the data and give us the answers,” he said.

∞∞∞

Wilfred Quality Jr. is an enthusiastic lad. He is always in the quest for the truth. This summer, he is interning at one of his father’s offices in Hillsboro. He was unhappy at the quality issue and was determined to find the answers. In his mind, he is also thinking this would make a neat little data analysis project.

His father and grandfather gave him three questions to investigate.

What proportion of their candies is wrongly packed – dummies?

Is it a one-off error or is it systematic?

What is the true nature of our candies, i.e., what are the population characteristics?

He had 30 days, enough monetary resources (it’s his company after all), and Rstudio at his disposal.

∞∞∞

For the next 25 days, he spent his time visiting around 40 local stores a day. From each store, he purchased a pack (100 candies) of “Classic Series Pineapple Candy.”

He also hired Keneth, the kid next door to help him with some experiments. Keneth’s job was to take the candies out of the candy pack and separate the candies from the no candies, like this.

He should also weigh them and enter the readings into a data table. Some of those candies become his prizes after he is done with the experiment.

Wilfred Jr. then analyzes them in RStudio.

All in all, they visited 1000 stores and prepared this data file. The data file is a matrix with 100 rows and 1000 columns. Each column is the data of the weight of the candies from each store.

They take the data from the first store and make a simple dot chart.

“See, the two dummies I found,” said Ken after looking at the plot on Wilfred’s screen.

“So that is like 2 in 100, a 2% error. Let’s take the second store and see what comes up,” said Wilfred.

“Hmm. There are no dummies in this sample. That’s 2 in 200 then – 1% error.”

“Should we plot a few stores and see if there are variations?” said Ken.

“Yes, we can do that. Let me take the first 12 stores and make the plots.” Wilfred types the following lines in RStudio and gets the plot.

# increasing number of stores # 
par(mfrow=c(4,3))
for (i in 1:12)
{
store_sample = candy_samples[,i]
stripchart(store_sample,font=2,cex=1.25,pch=21,cex.axis=1.5)
}

They see some variation. Some candy samples have dummies, some do not. “It looks like the dummies are not a one-off incident. It is a pattern. There is some proportion that are dummies,” thought Wilfred.

They decided to dig deeper.

They also decided to use around 1.5 grams as a cut off weight to screen for dummies.

If the weight is less than 1.5 grams in a sample, they will call it a dummy and compute the proportion of samples that are dummies.

Wilfred Jr. knows that the maximum likelihood estimator of p, the true proportion of dummies $\hat{p}=\frac{r}{n}$ .

r is the number of dummies in a sequence of n independent and identically distributed random sample. He is using this knowledge for estimating the proportion of dummies.

The more the samples, the better his estimate will be.

Each store he visits gives him a new sample. The total number of candies keep increasing. The more samples he gets, the closer he goes to the true population.

As the samples increase, the estimate gets closer to the truth.

“Let’s compute two estimates. $\hat{p_{1}}$ as something that keeps changing as we add more samples; getting better as n increases. $\hat{p_{2}}$ as the estimate for the store. Since we are visiting 1000 stores, $\hat{p_{2}}$ is a distribution of size 1000. It represents the overall distribution of the proportion of dummies across all our stores.”

Wilfred seems to be speaking to himself. Ken was only interested in counting the candies

Seeing him indifferent, Wilfred showed him the following video on his screen.

Ken’s face lit up. “It is wobbly at the beginning, but seems to stabilize at 0.02,” he said.

“Yes, as the sample size is increasing, the estimated proportion of dummies is better represented. So, the estimate is closer to the true estimate. It looks like our true proportion of dummies is somewhere around 2%. 0.02024 to be precise.”

This is the consistency criterion of the estimates.

$\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0$

Estimators are a function of sample size. As n approaches infinity, the sample estimate approaches the true parameter.

Fisher put it this way.

Wilfred also showed another graphic, the distribution of $\hat{p_{2}}$ .

This is the frequency distribution of the estimated proportion from each sample. For example, $\hat{p_{2}}$ for the first store is 0.02, for the second store is 0 and so on.

Wilfred pointed out that the estimate $\frac{r}{n}$ is an unbiased estimate of the true proportion of dummies. The distribution of the estimate $\hat{p_{2}}$ will be centered on the true proportion p.

He is showing the expected value of the distribution, 0.02024, as the red triangle in the plot.

$E[\hat{p_{2}}] = 0.02024$ , the same value we got from the large sample (population).

While there is variability from store to store, the average proportion of dummies is 0.02024.

Keneth is now curious about how to do these calculations and make such visuals.

Wilfred showed him his code.

# Proportions #

nstores = ncol(candy_samples) # number of stores 
ncandies = nrow(candy_samples) # number of candies per store

increase_sample = NULL # an empty value that will be filled as more an more samples --> approaches the population. 
n = NULL # an empty value that will be filled by growing sample size.

phat1_bad = NULL # proportion changes over samples... and stabilizes: Consistency or Ergodicity
phat2_bad = matrix(NA,nrow=nstores,ncol=1) # proportion value for each store. The expected value of this vector should be the population proportion E[phat]

for (i in 1:nstores)
{
store_sample = candy_samples[,i]
increase_sample = c(increase_sample,store_sample)
n = c(n,length(increase_sample))

p1 = length(which(increase_sample<1.5))/n[i]
phat1_bad = c(phat1_bad,p1)


p2 = length(which(store_sample<1.5))/ncandies
phat2_bad[i,1] = p2
}

# plot showing the consistency property #
sam = seq(from=1,to=1000,by=20)
for(i in 1:length(sam))
{
plot(0,0,col="white",xlim=c(0,100000),ylim=c(0.005,0.03),font=2,cex.lab=1.25,font.lab=2,xlab="Samples Size",ylab="Proportion")
lines(n[1:sam[i]],phat1_bad[1:sam[i]])
txt = paste("n = ",n[sam[i]])
text(10000,0.03,txt,col="red")
Sys.sleep(0.5)
}

# Histogram of phat2_bad # 
hist(phat2_bad,font=2,main="",xlab="Proportion",font.lab=2,col="grey")
points(mean(phat2_bad),0.0001,pch=17,cex=2.5,col="red")

While Ken and Wilfred start exploring the candy characteristics – true nature, you can set up today’s experiment yourself and understand the property of consistency and bias more closely using the data. See you in one week. Happy Coding.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 60 – Extreme value distributions in R

“In theory, there is no difference between theory and practice. In practice, there is.” — ??

By now, you recognize the pattern in this classroom. We follow up theory with practice. A set of lessons with a common theme will culminate with some experience in R. Being true to our trend, today, we leap into the world of extRemes. Put on your hard hat and bring your tools and machinery. Don’t forget your drink of the day.

On the seventeenth day of March, two thousand eighteen, we met Maximus Extremus Distributus. He was taking different forms depending on the origin (parent) distribution. You can create him too using a few simple lines of code in R.

Here is the creation code for normal origins. We generate N = 1000 normally distributed random variables with a zero mean and unit standard deviation, select the maximum value out of these 1000 values, and repeat the process 1000 times to get 1000 maximum values. These maximum values converge to the Type I extreme value distribution – Gumbel ( $e^{-e^{-y}}$ ). The code runs like an animation. You can control the speed by changing the number in Sys.sleep(). There is a plot that shows the convergence of the average values to a Normal distribution. Do you know why?

library(locfit)
library(logspline)

# Normal Origins #
par(mfrow=c(3,1))
x = rnorm(10000,0,1)
plot(logspline(x),xlim=c(-5,5),ylim=c(0,0.5),font=2,cex.lab=2,font.lab=2,xlab="Normal Distribution")

N = 1000
ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = rnorm(N,0,1)
 
 lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0.01) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(-5,5),ylim=c(0,9),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(-5,5),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white")
lines(locfit(~max1),lwd=2,col="red")

The creation code for exponential origins has the same procedure. We generate N = 1000 exponentially distributed random variables with $\lambda = 0.5$ as the parent. The maximum values of an exponential distribution again converge to the Gumbel distribution.

## EVD from Exponential## 
par(mfrow=c(3,1))
x = rexp(10000,0.5)
hist(x,prob=T,xlim=c(0,25),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Exponential Distribution")
plot(logspline(x),add=T)

N = 1000

ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = rexp(N,0.5)
 
 lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(0,25),ylim=c(0,4),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(0,25),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white")
lines(locfit(~max1),lwd=2,col="red")

Here is the creation code for uniform origins. This time the maximum values from uniform distribution converge to a different type of extreme value distribution, the Type III Weibull distribution ( $e^{-(-y)^{\gamma}}$ ).

# Uniform Origins #
x = runif(100000,0,1)
#hist(x,prob=T,xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution")
plot(density(x),xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution")

N = 200

ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = runif(N,0,1)
 
 # lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(0,1),ylim=c(0,30),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(0,1),ylim=c(0,65),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Weibull)",cex.lab=2,col="white")
lines(density(max1),lwd=2,col="red")

I want you to experiment with other types of origin distributions.

The three Types (Gumbel, Frechet, Weibull) and the GENERALIZED EXTREME VALUE DISTRIBUTION (GEV)

The three types of extreme value distributions have double exponential and single exponential forms. The maxima of independent random variables converge (in the limit when $n \to \infty$ ) to one of the three types, Gumbel ( $e^{-e^{-y}}$ ), Frechet ( $e^{-y^{-\gamma}}$ ) or Weibull ( $e^{-(-y)^{\gamma}}$ ) depending on the parent distribution.

We saw last week that these three types could be combined into a single function called the generalized extreme value distribution (GEV).

$G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}$

$\mu$ is the location parameter. $\sigma > 0$ is the scale parameter. $\xi$ controls the shape of the distribution (shape parameter).

When $\xi \to 0$ , GEV tends to a Gumbel distribution.

When $\xi > 0$ , GEV tends to the Frechet distribution.

When $\xi < 0$ , GEV tends to the Weibull distribution.

Its okay if you don’t know the origin distribution for an extreme dataset. GEV folds all the three types into one form, and the parameters $\mu$ , $\sigma$ , and $\xi$ can be estimated from the data. The function has a closed form solution to compute the quantiles and probabilities.

GEV IN R

Let’s play with some data and use GEV in R. We will use two datasets, NYC temperature, and cycles to fatigue of steel. The NYC temperature data is the same file we used in lesson 56. The cycles to fatigue is the data from our labs where we measured the maximum number of cycles before failure due to fatigue for ten steel specimens.

Read the files into the workspace using the following lines.

# fatigue data
fatigue = read.table("cycles_fatigue.txt",header=F)

# temperature data 
temperature_data = read.csv("cp_temperature.csv",header=T)

ExtRemes Package

We need to install a package in R called extRemes. Type the following lines in your code.

install.packages("extRemes")
library(extRemes)

This package has functions built for GEV distribution. Let’s try a few simple things first by generating random variables of the three types. Type the following lines in your code.

x = revd(10000,loc=0,scale=1,shape=0)

This command (revd) will generate 10000 GEV random variables with a location of 0, scale of 1 and shape of 0. So, this is a Gumbel distribution.

Type these lines and see what you get.

# Generate Gumbel Random numbers
x = revd(10000,loc=0,scale=1,shape=0)
hist(x,prob=T,xlab="Random Variables from Gumbel (location = 0,scale = 1, shape =0)",main="Gumbel Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(x),add=T)

Now hold the shape parameter constant at 0 and alter the location and scale parameters. A change in the location parameter will shift the distribution; a change in the scale parameter will stretch or shrink the distribution.

I guess you know the chores now.

Yes, we experiment with positive and negative shape parameters to generate the Frechet and Weibull distributions.

Generate Frechet Random numbers

# Frechet distribution plot
y = revd(10000,loc=0,scale=1,shape=0.2)
hist(y,prob=T,ylim=c(0,0.4),xlab="Random Variables from Frechet (location = 0, scale = 1, shape = 0.2)",main="Frechet Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(y),add=T,col="red")
plot(logspline(x),add=T)

Generate Weibull Random numbers

## Weibull Distribution Plot 
z = revd(10000,loc=0,scale=1,shape=-0.6)
hist(z,prob=T,ylim=c(0,0.5),xlab="Random Variables from Weibull (location = 0, scale = 1, shape = -0.6)",main="Weibull Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(z),add=T,col="red")

If you are comfortable with this, it is time to get your hand dirty with real data.

Fitting GEV distribution to data

Let’s examine the maximum cycles to fatigue data. We do not know which extreme value distribution it follows. If we fit a GEV and observe the shape parameter, we can say with certain confidence that the data follows Type I, Type II or Type III distribution.

For this, we can use the fevd command. Let’s use the fevd function to fit a GEV to the cycles to fatigue data.

fatigue_cycles = as.matrix(fatigue)

fit = fevd(fatigue_cycles,type="GEV")
summary(fit)

[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 132.8045

Estimated parameters:
 location scale shape 
 5.765016e+05 1.680629e+05 -5.231015e-01

Standard Error Estimates:
 location scale shape 
6.276363e+04 6.709826e+04 4.199849e-01

Estimated parameter covariance matrix.
 location scale shape
location 3.939273e+09 -527662537.18 -9.654182e+03
scale -5.276625e+08 4502177071.08 -2.306339e+04
shape -9.654182e+03 -23063.39 1.763873e-01

AIC = 271.6089

BIC = 272.5167

Pay attention to the red highlighted text for now. It shows the results for the estimated parameters. The shape parameter is -0.52 ( $\xi < 0$ ). So, a Weibull distribution fits the data with high likelihood.

Block Maxima

Let’s examine the temperature data. Assume we are interested in analyzing the data for maximum temperature each year. Remember we only care about the extremes. For this, we should first extract the annual maximum temperature values. This idea is called the block maxima. Each year is a block, and we get the maximum for each year.

Type the following lines in your code to get the annual maximum temperature values from 1951 to 2017.

# Annual Maximum Ave Temperature #
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

annmax = matrix(NA,nrow=n,ncol=1)
for (i in 1:n)
{
 yr = 1950 + i
 index = which((temperature_data[,1] == yr))
 temperature_yr = temperature_data[index,4]
 annmax[i,1] = max(temperature_yr,na.rm=T)
}

hist(annmax,xlab="Annual Maximum Temperature",font=2,main="",font.lab=2)

There is one value very different and far away from all other values. This phenomenon is the feature of the extreme values.

Rare events never seen before can occur. Using a model (e.g., GEV function) for these “unknowns” comes with uncertainty.

Now let’s fit GEV to the temperature data and look at a few things.

fit_temperature = fevd(annmax,type="GEV")

The fevd function will fit a GEV distribution to the data. The location, scale and shape parameters of the function are estimated based on the data.

Now type the following line in your code.

plot(fit_temperature)

You should see the following figure appear in the plot window.

For now, focus on the bottom two images. The one in the bottom left is showing how well the GEV function (blue line) is matching the observed data (black line). Pretty reasonable.

Can you tell me what is the image on the bottom right?

Okay, let me ask you this. What is the probability that the annual maximum temperature will be less than or equal to 92 degrees F?

Yes, you can compute this from the cumulative distribution function of the GEV distribution.

$G(z) = P(Z \le z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}$

Substitute $z = 92, \mu = 86.15, \sigma = 2.65, \xi = -0.03$ in the equation and you get $P(Z \le z) = 0.903$ .

I am sure you know why I substituted $\mu = 86.15, \sigma = 2.65, \xi = -0.03$ . These are the estimated parameters you get from the summary of the fit.

So, the probability that the annual maximum temperature will be less than or equal to 92 degrees F is 0.9.

Can I say, that the probability of exceeding 92 degrees F is 0.1?

$P(Z > 92) = 0.1$

A temperature of 92 degrees F is exceeded 10% of the times.

What do you know about exceedance probability and return period? If a random variable is exceeded with 10% probability, what is the frequency of its occurrence? Lesson 34? The language of return period?

Return Period $T = \frac{1}{P(exceedance)}$

So if 92 degrees is the temperature that exceeds 10% of the times, its return period is ten years.

On the average, daily temperature as high as 92 degrees will occur once every ten years in New York City.

In equation form, Return Period of a quantile z is $T = \frac{1}{1 - G(z)}$ .

Now, can you answer this?

What temperature (z) occurs once in 50 years?

…

I know you are thinking of inverting the return period equation. That is the way.

$G(z) = 1 - \frac{1}{T}$

$e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}} = 1 - \frac{1}{T}$

$[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi} = -ln(1 - \frac{1}{T})$

$1+\xi(\frac{z-\mu}{\sigma}) = (-ln(1 - \frac{1}{T}))^{-\xi}$

$z = \mu + \frac{\sigma}{\xi}((-ln(1 - \frac{1}{T}))^{-\xi} - 1)$

Now if you substitute $T = 50$ years and $\mu = 86.15, \sigma = 2.65, \xi = -0.03$ , you get z = 95.95.

The 50 year return period temperature is 95.95 degrees F.

This is what is being computed internally within the extRemes package to create the plot you see. For quantile z, extRemes package has qevd() function where you have to input probability p and other parameters. For probability $P(Z \le z)$ , it is pevd(), and you have to input the quantile z and the other parameters.

We can get their relations from the plot.

What are those dashed lines in the return period plot?

The 500 year return period temperature is 101.3 degree F. What does it even mean when they say the return period is 500 years?

How can something happen once in 500 years when we do not have 500 years of data?

Do you want to believe it? How confident are you in this estimate?

If this does not spin your head, let me add more. Look at the summary of the model once again.

summary(fit_temperature)

fevd(x = annmax, type = "GEV")

[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value: 169.1602

 Estimated parameters:
 location scale shape 
86.14530247 2.64886498 -0.02706872

Standard Error Estimates:
 location scale shape 
0.3552756 0.2500867 0.0663774

Estimated parameter covariance matrix.
 location scale shape
location 0.126220730 0.028413268 -0.006875094
scale 0.028413268 0.062543342 -0.003859995
shape -0.006875094 -0.003859995 0.004405960

AIC = 344.3204

BIC = 350.9344

The shape parameter is negative but very close to 0. Can we say that the GEV is a Gumbel distribution? Why not Weibull?

Did you ask yourself how these parameters are estimated from the data? How much data is required for this?

Did you ask yourself whether the shape parameter value is significantly different from 0 to say it is not Gumbel?

You can see so many new terms, “Estimation Method used: MLE,” standard error, and so on. What are these? What about the other two images in the fevd plot? Aren’t you curious about them? Well, hold on to that suspense for a few weeks. Very soon, we will start a new journey of inference.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 56 – Continuous distributions in R: Part II

The sun breaks, the clock rolls, and the temperature warms marking the end of the hibernation.

Last year, when we rolled back into the winter, we were learning discrete distributions in R using November temperature data.

It behooves that today, inviting the warmth, we continue learning continuous distributions using March temperature.

It’s 40-degree Fahrenheit with 40% humidity today in New York. Pleasant enough to take a nice walk without much effort. It comes after a stormy week.

Last week, before leaving, I asked you whether we will see a gusty event before today. Indeed, midweek, we had another windy Wednesday. The expected wait time is 19 days. It does not happen every 19 days. It may happen earlier, or we may have to wait longer. They average out to 19 days.

Since it is warm and relatively humid, we should analyze the data for March temperature.

Get the data from our source. Go through the chores of setting up the code. Coffee is a must.

FIRST STEPS

Step 1: You can download the filtered data file here.

Step 2: Create a new folder, “lesson56,” and save the data file in this folder.

Step 3: Create a new code for this lesson, “lesson56_code.R,” and save it in the same folder.

Step 4: Choose your working directory. “lesson56” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder.

setwd("path to your folder")

Step 5: Read the data into R workspace

I have the data in the file named “cp_temperature.csv.”

There are six columns in this data file. Year, Month, Day, TAVG (the average temperature for the day), TMAX (the maximum temperature for the day) and TMIN (the minimum temperature for the day).

We have data from 1869 onwards. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the file #
cp_data = read.csv("cp_temperature.csv",header=TRUE)

Let’s Understand the Central Limit Theorem

I want to take the data for TAVG in March and check its distribution. Since we are only nine days into March 2018, we will take the 31 days data for March 2017. You can use the which command to extract data.

# Extract March 2017 Daily Temperature Data #
index = which((cp_data[,2]==3) & (cp_data[,1] == 2017))
march_2017 = cp_data[index,4]

Through this command, you are looking for the days when the second column (Month) is 3 (March) and the first column (Year) is 2017. So “index” will be the row index in the data matrix. Then, we extract the fourth column (TAVG) of these rows.

This is how the probability distribution of these 31 values looks. Use the following lines to make this plot.

library(locfit)
hist(march_2017,prob=T,main="",font=2,font.lab=2,xlab="March 2017 Daily Temperature")
lines(locfit(~march_2017))

Doesn’t it look crazy? Let’s look at 2016 March daily data also.

These values are much more skewed.

Recall central limit theorem from lesson 48.

If $S_{n}$ is the sum or average of $n$ independent random variables, then the distribution function of $S_{n}$ can be well-approximated by a continuous function knows as the normal density function.

Let’s develop a simple code to see this manifest. These are the steps.

Take the 31 days TAVG data from March of each year. For simplicity, I will do it from 1951 to 2017. 67 years.
Plot the distribution of these 31 values. They may be skewed.
Compute the average monthly temperature for March. AVG is the average of the 31 values in March in that year. $AVG = \frac{1}{n}\sum_{i=1}^{i=31}TAVG_{i}$ .
Analyze the distribution of these averages (AVG). Because AVG is the average of n random variables, in the limit, it should converge to a normal distribution. In our case, n = 31; we create 67 values of AVG (1951 to 2017) and verify the probability distribution.

Here is the view of the manifestation of the central limit theorem. I set it up like an animation to run in the plot domain.

In the top panel, you will see the distribution of TAVG for March for each year unfolding. Each time I am marking the average of these 31 values with a red color triangle. Then, in the second panel, these red triangles are plotted as a distribution.

Here is the code for it. You can try it yourself.

## Code to demonstrate the Central Limit Theorem ##
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

index = which((cp_data[,2]==3) & (cp_data[,1] == yr1)) 
march_daily_temperature = cp_data[index,4]

par(mfrow=c(2,1))
plot(locfit(~march_daily_temperature),main="Average Daily Temperature",col="grey",xlim=c(10,80),ylim=c(0,0.1),font=2,font.lab=2,xlab="Temperature (deg F)")
points(mean(march_daily_temperature,na.rm=T),0,col="red",pch=25)

avg = matrix(NA,nrow=n,ncol=1)
for (i in 1: n)
{
 yr = 1950 + i
 index2 = which((cp_data[,1]==yr) & (cp_data[,2]==3))
 daily = cp_data[index2,4]

lines(locfit(~daily),col="grey")
 
 avg[i,1] = mean(cp_data[index2,4],na.rm=T)
 points(avg[i,1],0,col="red",pch=25)
 Sys.sleep(0.5) 
}
plot(locfit(~avg),main="Average Monthly Temperature",col="white",lwd=2,xlim=c(10,80),font=2,font.lab=2,xlab="Temperature (deg F)")
lines(locfit(~avg),col="red",lwd=2)

In lesson 48, I also showed how the normal distribution emerges from the convolution of Uniform and Poisson distribution. Here is the code for it.

# Uniform Dist #
x1 = runif(20000)
x2 = runif(20000)
x3 = runif(20000)
x4 = runif(20000)
x5 = runif(20000)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5

plot(density(x1),xlim=c(0,5),ylim=c(0,2),font=2,font.lab=2,main="",xlab="Sn")
text(0.2,1.1,"Sn = X1")
Sys.sleep(1)

lines(density(z2),col="green",lwd=2)
text(1.2,1.1,"Sn = X1+X2",col="green")
Sys.sleep(1)

lines(density(z3),col="pink",lwd=2)
text(2.2,0.8,"Sn = X1+X2+X3",col="pink")
Sys.sleep(1)

lines(density(z4),col="grey",lwd=2)
text(3.1,0.6,"Sn = X1+X2+X3+X4",col="grey")
Sys.sleep(1)

lines(density(z5),col="yellow",lwd=2)
text(3.6,0.3,"Sn = X1+X2+X3+X4+X5",col="yellow")
Sys.sleep(1)

# Poisson Dist #
lambda = 1
x1 = rpois(10000,lambda)
x2 = rpois(10000,lambda)
x3 = rpois(10000,lambda)
x4 = rpois(10000,lambda)
x5 = rpois(10000,lambda)
x6 = rpois(10000,lambda)
x7 = rpois(10000,lambda)
x8 = rpois(10000,lambda)
x9 = rpois(10000,lambda)
x10 = rpois(10000,lambda)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5
z6 = x1 + x2 + x3 + x4 + x5 + x6
z7 = x1 + x2 + x3 + x4 + x5 + x6 + x7
z8 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
z9 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
z10 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Sys.sleep(1) 
hist(x1,xlim=c(0,20),ylim=c(0,5000),col="red",breaks=50,main="",font=2,font.lab=2,xlab="Sn")
Sys.sleep(1)

hist(z2,add=T,col="green",breaks=50)
Sys.sleep(1)

hist(z3,add=T,col="pink",breaks=50)
Sys.sleep(1)

hist(z4,add=T,col="grey",breaks=50)
Sys.sleep(1)

hist(z5,add=T,col="yellow",breaks=50)
Sys.sleep(1)

hist(z6,add=T,col="orange",breaks=50)
Sys.sleep(1)

hist(z7,add=T,col="blue",breaks=50)
Sys.sleep(1)

hist(z8,add=T,col="violet",breaks=50)
Sys.sleep(1)

hist(z9,add=T,col="magenta",breaks=50)
Sys.sleep(1)

hist(z10,add=T,col="black",breaks=50)
text(15,1100,"Sn = X1 + X2 + ... + X10")

Normal Distribution

We now know that AVG is a normal distribution.

What is the probability that the average March monthly temperature AVG will be less than or equal to 40?

I know you are fetching for your standard normal tables to compute

$P(AVG \le 40) = P(\frac{AVG-\mu}{\sigma} \le \frac{40-42.197}{3.113}) = P(Z \le -0.706) = 0.24$

While you can do this on paper, in R, you can just use the pnorm command like this.

# Probability Computations #
pnorm(40,mean(avg),sd(avg))

If you want to compute the probability that AVG will exceed 50 deg F, you can do it like this.

1 - pnorm(50,mean(avg),sd(avg))

Now, think about the qnorm command.

Lognormal Distribution

If AVG follows a normal distribution, then, $X = e^{AVG}$ is a lognormal distribution because log of X is normal.

We know that exponentiating a normal distribution will result in lognormal distribution.

The average monthly March temperature, AVG, if exponentiated, will transform into a lognormal distribution.

Instead of merely taking the numbers and showing the distribution of the exponentials, I think it will be fun to see where in the real world we do exponential of temperature.

Have you ever wondered if air temperature influences the amount of water vapor it can hold?

The water holding capacity (vapor) increases as the temperature increases. The higher the temperature, the stronger will be the bond between water and air molecules.

We can measure the vapor pressure using this simple equation.

$VP = 0.61078e^{\frac{17.27T}{T+243.05}}$

VP is the vapor pressure in kilopascals, T is the temperature in deg C. This is the famous Clausius-Clapeyron approximation used in meteorology and climatology to compute the vapor pressure.

You can check the equation and see that the water vapor pressure changes approximately exponentially with temperature. The water-holding capacity of the atmosphere increases by about 7% for every 1 °C rise in temperature.

Let’s compute the vapor pressure for these 67 values of temperature and check out the distribution of vapor pressure, which is roughly exponential of a normal distribution.

## Vapor Pressure ##
Tc = (avg-32)*(5/9) 
vp = 0.61078*exp((17.27*Tc)/(Tc+243.05))

# Plotting and Comparing #
par(mfrow=c(2,1))
hist(Tc,prob=T,main="Average Monthly Temperature",lwd=2,font=2,font.lab=2,xlab="Temperature (deg C)")
lines(locfit(~Tc),col="grey")

hist(vp,prob=T,main="Vapor Pressure ~ exp(T)",lwd=2,font=2,font.lab=2,xlab="Vapor Pressure (kpa)")
lines(locfit(~vp),col="grey")

You should see this plot in your plot domain.

The top panel is the temperature data expressed in deg C.

The bottom panel is the vapor pressure. See how the data becomes skewed. We transformed the data from normal to a lognormal distribution.

You must have noticed that I am using a line par(mfrow). This command is used to make multiple plots as panels. par(mfrow = c(2,1)) will split the plot space into two rows and one column so that we can make two plots, one below the other.

Think about the square of the temperature. $X = AVG^{2}$ . What distribution will it follow? AVG square – Chi-square?

A standard rule of thumb for distributions in R is to use the letters p and q in front of the name of the distribution.

Exponential distribution
pexp to compute the probability.
qexp to compute the quantile.

Gamma distribution
pgamma to compute the probability.
qgamma to compute the quantile.

Normal distribution
pnorm to compute the probability.
qnorm to compute the quantile.

Lognormal distribution
plnorm to compute the probability.
qlnorm to compute the quantile.

Chi-square distribution
pchisq to compute the probability.
qchisq to compute the quantile.

You should have guessed the command if you want to draw random variables from a distribution.

And so, as the beautiful warm day ends, we end our discussion of continuous distributions for now.

As the clocks rolleth, the days breaketh soon for you to relish the spring warmeth. Relish the warmeth, for the “extremum” will cometh.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

The Basic Steps

Step1: Get the data

Step 2: Create a new folder on your computer

Step 3: Create a new code in R

Step 4: Choose your working directory

Step 5: Read the data into R workspace

Step 6: Extracting a subset of the data

The Questions for Hypotheses

Hypothesis Test on the Difference in Means

Hypothesis Test on the Equality of Variances

Hypothesis Test on the Difference in Proportions

The Basic Steps

The Baseline for Null Hypotheses

Hypothesis Test on the Mean

Hypothesis Test on the Variance

Hypothesis Test on the Proportion

Here is the full code for today’s lesson.

Simple Random Sampling

Stratified Random Sampling

Cluster Random Sampling

Systematic Random Sampling

Jenny explains sampling bias

Where is the Mean

Samantha, John, and Christina explore

Jenny puts all the intervals in context

T or Z

Where is the Standard Deviation and Proportion

Let’s Start with the Mean

Next, let’s explore the Variance and the Standard Deviation

Jenny continued the lesson.

Let’s do a Bootstrap in R

T-2 days

What are the population characteristics?

“What about Variance?”

T-0 days

“In theory, there is no difference between theory and practice. In practice, there is.” — ??

The three Types (Gumbel, Frechet, Weibull) and the GENERALIZED EXTREME VALUE DISTRIBUTION (GEV)

GEV IN R

FIRST STEPS

Let’s Understand the Central Limit Theorem

Normal Distribution

Lognormal Distribution

Enjoy this blog? Please spread the word :)