Lesson 70 – The quest for truth: Learning estimators in R, part II

T-2 days

Wilfred Jr. was typing away on his Macintosh. Keneth is writing something in his notebook.

“Do you have an estimate yet?” asked Wilfred. “In a minute.”

Wilfred Jr. and Keneth were busy preparing for the meeting in a couple of days. They are systematically analyzing the data they collected over the past few weeks. Their workplace was well lit with a strong beam of sunlight through the window. They both looked out, it was deceivingly pleasant. It is one of those newsworthy summer days. The local weatherman has already given it a rank 2 in his “new normals” show.

The wall behind the big screen computer is a full corkboard that has sticky notes and pins.

“We have an answer for the first two questions,” said Ken. “Last week, we checked this and found that the true proportion of dummies is somewhere around 2% – 0.02024.

“There you go. Look at this.” Wilfred hit Command + Enter on his keyboard. “I like this version better,” he said.

Keneth looked at the code and wondered if Willy would explain to him what was going on.

“So, we are showing how the estimate for the proportion of dummies  \hat{p}=\frac{r}{n} changes as we add more data from the stores. The more the data (i.e., the stores we visit), the closer the estimate goes to the truth. After a while, the change in the value becomes small and the estimate stabilizes and converges to the truth. It is the consistency criterion of the estimates,” said Wilfred as he points out to the blue line on the graph.

“What are the red dots and the boxplot that shows up at the end?” asked Keneth.

“Good questions. Each red dots that appears vertically at the end is the estimated proportion for a given store. So, if a store has 2% dummies, the estimate for that store is 0.02, and a red dot appears at 0.02. Like this, we have 1000 different sample estimates for the proportion of dummies. The boxplot at the end is showing this distribution. The distribution of the sample proportions. Did you notice anything?” asked Wilfred.

Ken looked at the plot again, this time with more attention.

“The blue line meets the big red circle. I can see that the big red circle is the mean of the red dots. E[\hat{p}]. The boxplot also looks asymmetric.” The expected value of the sample proportions and the long-run estimate are the same.”

E[\hat{p}]=p=0.02024

Wilfred nods in acceptance. “We discussed this last week. Since the estimate \frac{r}{n} is an unbiased estimate of the true proportion, the mean of the sample distribution will be the true proportion. The long-run converged estimate is also the same value.”

“Can you share the code with me? I want to take some time and learn how to make these visuals.” Keneth is imagining how cool he would look in front of his friends when he presents his summer project.

“Sure. Here it is. I wrote comments for each line. It should be self-explanatory.”

# Graphic analysis for proportion #

# For each store, compute the sample proportion #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample proportion #
# This will converge and stabilize at the true parameter value #

# Reading the candy data file #
candy_data = read.table("candy_samples.txt",header=F)

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_proportion = NULL # as the sample increases, we re-compute the sample proportion of dummy candies

sample_p = matrix(NA,nrow=nstores,ncol=1) # saving the estimate from each store. The expected value of this vector = truth

plot(0,0,col="white",xlim=c(0,1300),ylim=c(0,0.08),xlab="Stores",ylab="Proportion of Dummies",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
p = length(dummies)/(ncandies*i)
sample_proportion = c(sample_proportion,p)

# storing the sample weight for the store in a matrix 
sample_p[i,1] = length(dummy_index)/ncandies

# plot showing the trace of the sample proportion -- converges to truth
points(i,sample_proportion[i],cex=0.075,col="blue")
points(1000,sample_p[i],cex=0.1,col="red")
Sys.sleep(0.05)
#print(i)
}

Sys.sleep(1)
points(1000,mean(sample_p),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_p,add=T,at=1050,boxwex=25,range=0)
Sys.sleep(1)
text(1200,mean(sample_p),expression(E(hat(p))==0.02),col="red",font=2)
What are the population characteristics?

Their attention now shifted to the final question. When the day started, they discussed some ideas and decided they will present the analysis that investigated for the true mean and the true variance of the population.

If the weight of all the Quality Candies is the population, then, it can be represented using the mean (\mu) and the variance (\sigma^{2}) of the population.

Willy and Ken’s task is to find these values. Estimate the parameters.

Here is what they did.

They visited the first shop and found that the estimated proportion of dummy candies in 0.02. As Ken separated the candies from the dummies, he also put them on a weighing scale and measured the weight of each candy.

The average weight of the 98 candies from the first store is 3.00748 grams.

Ken estimated this using the formula \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. This is the maximum likelihood estimator for the true mean.

They record this as the estimated sample mean for the first store.

They went to the second store. Here they did not find any dummies. The estimated sample mean weight of the 100 candies from this store is 3.0073 grams. This is another estimate of the true mean.

Overall, they now have 198 candies (a bigger sample). Ken also computed the sample mean for these 198 candies. This comes out as 3.007389 grams.

Like this, as they visited more and more stores, they re-computed the average weight of the candies using a larger sample. They also computed the individual estimate of the sample mean for that store.

Like in the case of the proportion of dummies, they want to check the convergence of the estimated mean.

“I think you can use the same code as before, but change the lines where you computed the proportions to computing the sample means,” said Ken as he is seen knee-deep into the lines of the code.

“You are correct! All I did was that and here we go. I changed the color of the converging line.”

“I will interpret this now,” said Ken. “The sample mean  \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean  \mu . I know it from Lesson 67. The big red circle is the expected value of the sample means, the center of the distribution. So this is E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]. This is equal to the true mean \mu. The orange line, the estimate with increasing sample size also converges to this number.”

E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}] = \mu = 3

“Excellent. What else do you observe? Anything from the boxplot?”

“Hmm, it does look symmetric. Is the distribution of the sample means normally distributed?” asked Ken.

“I will let you mull over that. You do know about the Central Limit Theorem, right?” Wilfred finished writing some comments in his code.

# Graphic analysis for mean #

# For each store, compute the sample mean #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample mean #
# This will converge and stabilize at the true parameter value #

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_meanweight = NULL # as the sample increases, we re-compute the sample mean of the candies

sample_mean = matrix(NA,nrow=nstores,ncol=1) # save the value of sample mean for each store

plot(0,0,xlim=c(0,1200),ylim=c(2.945,3.055),xlab="Stores",ylab="Average Candy Weight",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
xbar = mean(candies)
sample_meanweight = c(sample_meanweight,xbar)

# storing the sample weight for the store in a matrix 
sample_mean[i,1] = mean(sample_candies)

# plot showing the trace of the sample mean -- converges to truth
points(i,sample_meanweight[i],cex=0.15,col="orange")
points(1000,sample_mean[i],cex=0.15,col="red")
Sys.sleep(0.05)
print(i)
}

Sys.sleep(1)
points(1000,mean(sample_mean),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_mean,add=T,range=0,boxwex=25,at=1050)
Sys.sleep(1)
text(1150,mean(sample_mean),expression(E(hat(mu))==3),col="red",font=2)
“What about Variance?”

Asked Wilfred.

“Same procedure. In fact, when I computed the sample mean for each store and the overall incremental sample, I also computed the sample variances using its maximum likelihood estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. I used this formula for each store and the overall incremental sample.”

“Give me a few minutes, I will code this up and show.” Ken looked excited. He finally got to write his code and look at the outputs. He typed up a few lines, mostly editing the previous versions.

Here is what they found.

Ken looked puzzled. He expected to see the same pattern, however, much to his surprise, he found that the expected value of the sample variance E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}] and the long-run converged estimate for the variance are not the same.

E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}] \neq \sigma^{2}

“Is there a bias?” he asked.

Wilfred was expecting this. He had done his homework from Lesson 67!

“Yes, there is.”

“The maximum likelihood estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} is a biased estimator.”

“Let me show you why.”

“I will rewrite the estimator as \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}} by using \bar{x} in place of \mu.”

\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}

=\frac{1}{n}\sum(x_{i}^{2} + \bar{x}^{2} -2\bar{x}x_{i})

=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} \sum \bar{x}^{2} - \frac{2}{n}\sum x_{i}\bar{x}

=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} n \bar{x}^{2} - 2 \bar{x} \bar{x}

=\frac{1}{n}\sum x_{i}^{2} + \bar{x}^{2} - 2 \bar{x}^{2}

=\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}

\hat{\sigma^{2}}=\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})

“Now, let’s apply the expectation operator on this formula.”

E[\hat{\sigma^{2}}]=E[\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})]

=\frac{1}{n}(\sum E[x_{i}^{2}] - n E[\bar{x}^{2}])

=\frac{1}{n}(\sum (V[x_{i}]+(E[x_{i}])^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))

“I am sure you know why I wrote that step. It is from the property of variance operator.”

“Now, since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. I will use this simple idea to reduce our equation.”

E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))

“If you go back and read Lesson 68 (Bias joins Variance), you will see that V[\bar{x}]=\frac{\sigma^{2}}{n}. You already know that E[\bar{x}]=\mu. Let’s substitute those here.”

E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (\frac{\sigma^{2}}{n}+\mu^{2}))

E[\hat{\sigma^{2}}] = \frac{1}{n}(n\sigma^{2} + n\mu^{2} - \sigma^{2} -n\mu^{2})

E[\hat{\sigma^{2}}] = \frac{n-1}{n}\sigma^{2}

“So, \hat{\sigma^{2}} is biased by -\frac{\sigma^{2}}{n}.”

“That is why you are seeing that the expected value of the sample variances is not the same as where the overall incremental estimate (population variance) converges. The estimator with n in the denominator underestimates the true variance \sigma^{2} because of the negative bias.”

If you want an unbiased estimator, you should use \frac{1}{n-1} in the denominator instead of n.

The unbiased estimator for variance is \hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}

Wilfred added a few lines in the code and showed this plot to Ken.

“Look at this. I have added another boxplot to your figure. This boxplot is the distribution of the bias-adjusted (unbiased) variance estimates. The green dot is now the expected value of this distribution, which matches with the true variance.”

Ken looks at it with a happy face, but something seems to bother him.”

“What is it?” asked Jr.

“Is the distribution of variance asymmetric?” he asked.

“Yes. Is that all?”

“Well, if the variance is bias-adjusted to match the truth, does it also mean the standard deviation, which is \sqrt{\hat{\sigma^{2}}}, will get adjusted?”

“That is a good question.”

No, although \hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}} is an unbiased estimator for \sigma^{2}, \hat{\sigma}=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}} is not an unbiased estimator for \sigma.

“Here, I re-ran the code with some changes to show this. The adjusted estimate still has some bias.”

“Very interesting. I guess we are ready for the meeting. The true mean weight of the candies is 3 grams and the true standard deviation of the weight of the candies is 0.2 grams. Can we have the code in the archives.” Ken starts packing his bag as he says this. He looks exhausted too.

“Here is the full code.”

T-0 days

Mr. Alfred Quality and Wilfred Quality Sr. were disappointed to know that there is a systematic packing error, although it is only 2%. They deployed their engineering team to fix this immediately.

They were nevertheless very impressed with the quality work of Wilfred Quality Jr. He owned the problem and came up with the answers. All along, he had collected a large sample that can be used for future tests, trained an enthusiastic high-school senior and showed that he is ready to be the next Mr. Quality for Quality Candies.

As he was leaving the boardroom, Mr. Alfred Quality asked him:
“Junior, great work, but how confident are you that the true value lies within two standard deviations?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)