Lesson 69 – The quest for truth: Learning estimators in R, part I

For the fifteenth time, Mr. Alfred Quality looks at the two candies in front of him and shakes his head. He is visibly disgruntled. He spent 50 years selling quality candies in North Dakota. The kids love it. Quality Candies are local, and classic North Dakotan. Now, this new “Classic Series Pineapple Candy” seems to have deviated from his expectations.

There is a knock on the door. “That must be Wilfred,” thought Mr. Quality in his mind as he speaks in an elevated voice, “Come on in.”

Wilfred Quality, son of Alfred Quality gently opens the doors and walks in. He is well dressed and looks around 50 years old. He could notice the displeasure of his father as he steps closer.

“Son, I have noticed a lapse in quality in this Classic Series Candy. There is a candy with no candy. What would our customers think? We need immediate action on this. I want to know what proportion of our candy in the stores are dummies. We have to fix this before the customers accuse us of cheating by filling in empty candies. Is this a packing machine error, or is it a trick to adjust for inflation?”

“I will look into it. It must be a machine error. There would be no such inflation adjustment tricks, I assure you,” said Wilfred who is the current CEO of Quality Candies. He is running the business very efficiently since his father took golf seriously.

“I will ask Junior to collect the data and give us the answers,” he said.

∞∞∞

Wilfred Quality Jr. is an enthusiastic lad. He is always in the quest for the truth. This summer, he is interning at one of his father’s offices in Hillsboro. He was unhappy at the quality issue and was determined to find the answers. In his mind, he is also thinking this would make a neat little data analysis project.

His father and grandfather gave him three questions to investigate.

What proportion of their candies is wrongly packed – dummies?

Is it a one-off error or is it systematic?

What is the true nature of our candies, i.e., what are the population characteristics?

He had 30 days, enough monetary resources (it’s his company after all), and Rstudio at his disposal.

∞∞∞

For the next 25 days, he spent his time visiting around 40 local stores a day. From each store, he purchased a pack (100 candies) of “Classic Series Pineapple Candy.”

He also hired Keneth, the kid next door to help him with some experiments. Keneth’s job was to take the candies out of the candy pack and separate the candies from the no candies, like this.

He should also weigh them and enter the readings into a data table. Some of those candies become his prizes after he is done with the experiment.

Wilfred Jr. then analyzes them in RStudio.

All in all, they visited 1000 stores and prepared this data file. The data file is a matrix with 100 rows and 1000 columns. Each column is the data of the weight of the candies from each store.

They take the data from the first store and make a simple dot chart.

“See, the two dummies I found,” said Ken after looking at the plot on Wilfred’s screen.

“So that is like 2 in 100, a 2% error. Let’s take the second store and see what comes up,” said Wilfred.

“Hmm. There are no dummies in this sample. That’s 2 in 200 then – 1% error.”

“Should we plot a few stores and see if there are variations?” said Ken.

“Yes, we can do that. Let me take the first 12 stores and make the plots.” Wilfred types the following lines in RStudio and gets the plot.

# increasing number of stores # 
par(mfrow=c(4,3))
for (i in 1:12)
{
store_sample = candy_samples[,i]
stripchart(store_sample,font=2,cex=1.25,pch=21,cex.axis=1.5)
}

They see some variation. Some candy samples have dummies, some do not. “It looks like the dummies are not a one-off incident. It is a pattern. There is some proportion that are dummies,” thought Wilfred.

They decided to dig deeper.

They also decided to use around 1.5 grams as a cut off weight to screen for dummies.

If the weight is less than 1.5 grams in a sample, they will call it a dummy and compute the proportion of samples that are dummies.

Wilfred Jr. knows that the maximum likelihood estimator of p, the true proportion of dummies $\hat{p}=\frac{r}{n}$ .

r is the number of dummies in a sequence of n independent and identically distributed random sample. He is using this knowledge for estimating the proportion of dummies.

The more the samples, the better his estimate will be.

Each store he visits gives him a new sample. The total number of candies keep increasing. The more samples he gets, the closer he goes to the true population.

As the samples increase, the estimate gets closer to the truth.

“Let’s compute two estimates. $\hat{p_{1}}$ as something that keeps changing as we add more samples; getting better as n increases. $\hat{p_{2}}$ as the estimate for the store. Since we are visiting 1000 stores, $\hat{p_{2}}$ is a distribution of size 1000. It represents the overall distribution of the proportion of dummies across all our stores.”

Wilfred seems to be speaking to himself. Ken was only interested in counting the candies

Seeing him indifferent, Wilfred showed him the following video on his screen.

Ken’s face lit up. “It is wobbly at the beginning, but seems to stabilize at 0.02,” he said.

“Yes, as the sample size is increasing, the estimated proportion of dummies is better represented. So, the estimate is closer to the true estimate. It looks like our true proportion of dummies is somewhere around 2%. 0.02024 to be precise.”

This is the consistency criterion of the estimates.

$\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0$

Estimators are a function of sample size. As n approaches infinity, the sample estimate approaches the true parameter.

Fisher put it this way.

Wilfred also showed another graphic, the distribution of $\hat{p_{2}}$ .

This is the frequency distribution of the estimated proportion from each sample. For example, $\hat{p_{2}}$ for the first store is 0.02, for the second store is 0 and so on.

Wilfred pointed out that the estimate $\frac{r}{n}$ is an unbiased estimate of the true proportion of dummies. The distribution of the estimate $\hat{p_{2}}$ will be centered on the true proportion p.

He is showing the expected value of the distribution, 0.02024, as the red triangle in the plot.

$E[\hat{p_{2}}] = 0.02024$ , the same value we got from the large sample (population).

While there is variability from store to store, the average proportion of dummies is 0.02024.

Keneth is now curious about how to do these calculations and make such visuals.

Wilfred showed him his code.

# Proportions #

nstores = ncol(candy_samples) # number of stores 
ncandies = nrow(candy_samples) # number of candies per store

increase_sample = NULL # an empty value that will be filled as more an more samples --> approaches the population. 
n = NULL # an empty value that will be filled by growing sample size.

phat1_bad = NULL # proportion changes over samples... and stabilizes: Consistency or Ergodicity
phat2_bad = matrix(NA,nrow=nstores,ncol=1) # proportion value for each store. The expected value of this vector should be the population proportion E[phat]

for (i in 1:nstores)
{
store_sample = candy_samples[,i]
increase_sample = c(increase_sample,store_sample)
n = c(n,length(increase_sample))

p1 = length(which(increase_sample<1.5))/n[i]
phat1_bad = c(phat1_bad,p1)


p2 = length(which(store_sample<1.5))/ncandies
phat2_bad[i,1] = p2
}

# plot showing the consistency property #
sam = seq(from=1,to=1000,by=20)
for(i in 1:length(sam))
{
plot(0,0,col="white",xlim=c(0,100000),ylim=c(0.005,0.03),font=2,cex.lab=1.25,font.lab=2,xlab="Samples Size",ylab="Proportion")
lines(n[1:sam[i]],phat1_bad[1:sam[i]])
txt = paste("n = ",n[sam[i]])
text(10000,0.03,txt,col="red")
Sys.sleep(0.5)
}

# Histogram of phat2_bad # 
hist(phat2_bad,font=2,main="",xlab="Proportion",font.lab=2,col="grey")
points(mean(phat2_bad),0.0001,pch=17,cex=2.5,col="red")

While Ken and Wilfred start exploring the candy characteristics – true nature, you can set up today’s experiment yourself and understand the property of consistency and bias more closely using the data. See you in one week. Happy Coding.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)