Lesson 69 – The quest for truth: Learning estimators in R, part I

For the fifteenth time, Mr. Alfred Quality looks at the two candies in front of him and shakes his head. He is visibly disgruntled. He spent 50 years selling quality candies in North Dakota. The kids love it. Quality Candies are local, and classic North Dakotan. Now, this new “Classic Series Pineapple Candy” seems to have deviated from his expectations.

There is a knock on the door. “That must be Wilfred,” thought Mr. Quality in his mind as he speaks in an elevated voice, “Come on in.”

Wilfred Quality, son of Alfred Quality gently opens the doors and walks in. He is well dressed and looks around 50 years old. He could notice the displeasure of his father as he steps closer.

“Son, I have noticed a lapse in quality in this Classic Series Candy. There is a candy with no candy. What would our customers think? We need immediate action on this. I want to know what proportion of our candy in the stores are dummies. We have to fix this before the customers accuse us of cheating by filling in empty candies. Is this a packing machine error, or is it a trick to adjust for inflation?”

“I will look into it. It must be a machine error. There would be no such inflation adjustment tricks, I assure you,” said Wilfred who is the current CEO of Quality Candies. He is running the business very efficiently since his father took golf seriously.

“I will ask Junior to collect the data and give us the answers,” he said.

∞∞∞

Wilfred Quality Jr. is an enthusiastic lad. He is always in the quest for the truth. This summer, he is interning at one of his father’s offices in Hillsboro. He was unhappy at the quality issue and was determined to find the answers. In his mind, he is also thinking this would make a neat little data analysis project.

His father and grandfather gave him three questions to investigate.

What proportion of their candies is wrongly packed – dummies?

Is it a one-off error or is it systematic?

What is the true nature of our candies, i.e., what are the population characteristics?

He had 30 days, enough monetary resources (it’s his company after all), and Rstudio at his disposal.

∞∞∞

For the next 25 days, he spent his time visiting around 40 local stores a day. From each store, he purchased a pack (100 candies) of “Classic Series Pineapple Candy.”

He also hired Keneth, the kid next door to help him with some experiments. Keneth’s job was to take the candies out of the candy pack and separate the candies from the no candies, like this.

He should also weigh them and enter the readings into a data table. Some of those candies become his prizes after he is done with the experiment.

Wilfred Jr. then analyzes them in RStudio.

All in all, they visited 1000 stores and prepared this data file. The data file is a matrix with 100 rows and 1000 columns. Each column is the data of the weight of the candies from each store.

They take the data from the first store and make a simple dot chart.

“See, the two dummies I found,” said Ken after looking at the plot on Wilfred’s screen.

So that is like 2 in 100, a 2% error. Let’s take the second store and see what comes up,” said Wilfred.

“Hmm. There are no dummies in this sample. That’s 2 in 200 then – 1% error.

“Should we plot a few stores and see if there are variations?” said Ken.

“Yes, we can do that. Let me take the first 12 stores and make the plots.” Wilfred types the following lines in RStudio and gets the plot.

# increasing number of stores # 
par(mfrow=c(4,3))
for (i in 1:12)
{
store_sample = candy_samples[,i]
stripchart(store_sample,font=2,cex=1.25,pch=21,cex.axis=1.5)
}

They see some variation. Some candy samples have dummies, some do not. “It looks like the dummies are not a one-off incident. It is a pattern. There is some proportion that are dummies,” thought Wilfred.

They decided to dig deeper.

They also decided to use around 1.5 grams as a cut off weight to screen for dummies.

If the weight is less than 1.5 grams in a sample, they will call it a dummy and compute the proportion of samples that are dummies.

Wilfred Jr. knows that the maximum likelihood estimator of p, the true proportion of dummies \hat{p}=\frac{r}{n}.

r is the number of dummies in a sequence of n independent and identically distributed random sample. He is using this knowledge for estimating the proportion of dummies.

The more the samples, the better his estimate will be.

Each store he visits gives him a new sample. The total number of candies keep increasing. The more samples he gets, the closer he goes to the true population.

As the samples increase, the estimate gets closer to the truth.

“Let’s compute two estimates. \hat{p_{1}} as something that keeps changing as we add more samples; getting better as n increases. \hat{p_{2}} as the estimate for the store. Since we are visiting 1000 stores, \hat{p_{2}} is a distribution of size 1000. It represents the overall distribution of the proportion of dummies across all our stores.”

Wilfred seems to be speaking to himself. Ken was only interested in counting the candies

Seeing him indifferent, Wilfred showed him the following video on his screen.

Ken’s face lit up. “It is wobbly at the beginning, but seems to stabilize at 0.02,” he said.

“Yes, as the sample size is increasing, the estimated proportion of dummies is better represented. So, the estimate is closer to the true estimate. It looks like our true proportion of dummies is somewhere around 2%. 0.02024 to be precise.”

This is the consistency criterion of the estimates.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0

Estimators are a function of sample size. As n approaches infinity, the sample estimate approaches the true parameter.

Fisher put it this way.

Wilfred also showed another graphic, the distribution of \hat{p_{2}}.

This is the frequency distribution of the estimated proportion from each sample. For example, \hat{p_{2}} for the first store is 0.02, for the second store is 0 and so on.

Wilfred pointed out that the estimate \frac{r}{n} is an unbiased estimate of the true proportion of dummies. The distribution of the estimate \hat{p_{2}} will be centered on the true proportion p.

He is showing the expected value of the distribution, 0.02024, as the red triangle in the plot.

E[\hat{p_{2}}] = 0.02024, the same value we got from the large sample (population).

While there is variability from store to store, the average proportion of dummies is 0.02024.

Keneth is now curious about how to do these calculations and make such visuals.

Wilfred showed him his code.

# Proportions #

nstores = ncol(candy_samples) # number of stores 
ncandies = nrow(candy_samples) # number of candies per store

increase_sample = NULL # an empty value that will be filled as more an more samples --> approaches the population. 
n = NULL # an empty value that will be filled by growing sample size.

phat1_bad = NULL # proportion changes over samples... and stabilizes: Consistency or Ergodicity
phat2_bad = matrix(NA,nrow=nstores,ncol=1) # proportion value for each store. The expected value of this vector should be the population proportion E[phat]

for (i in 1:nstores)
{
store_sample = candy_samples[,i]
increase_sample = c(increase_sample,store_sample)
n = c(n,length(increase_sample))

p1 = length(which(increase_sample<1.5))/n[i]
phat1_bad = c(phat1_bad,p1)


p2 = length(which(store_sample<1.5))/ncandies
phat2_bad[i,1] = p2
}

# plot showing the consistency property #
sam = seq(from=1,to=1000,by=20)
for(i in 1:length(sam))
{
plot(0,0,col="white",xlim=c(0,100000),ylim=c(0.005,0.03),font=2,cex.lab=1.25,font.lab=2,xlab="Samples Size",ylab="Proportion")
lines(n[1:sam[i]],phat1_bad[1:sam[i]])
txt = paste("n = ",n[sam[i]])
text(10000,0.03,txt,col="red")
Sys.sleep(0.5)
}

# Histogram of phat2_bad # 
hist(phat2_bad,font=2,main="",xlab="Proportion",font.lab=2,col="grey")
points(mean(phat2_bad),0.0001,pch=17,cex=2.5,col="red")

While Ken and Wilfred start exploring the candy characteristics – true nature, you can set up today’s experiment yourself and understand the property of consistency and bias more closely using the data. See you in one week. Happy Coding.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 68 – Bias joins Variance

In the last week’s “guess the age” contest, Madam Chairman Yellen was the clear winner. The expected value of the estimated age was 69.69 years, a nominal bias of 1.31 years from her true age.

However, the great Thomas Sowell is the clear winner of our hearts. May he always remain a generation young, like his ideas.

Look at the table once again.

Can you notice the variation in the estimates of the groups?

Yes, Chairman Yellen’s estimated age ranges from 62 years to 78 years, while Dr. Sowell’s estimated age ranges from 55 years to 70 years.

Based on the data from the 16 groups, I computed the variance (and the standard deviation) of the estimated ages.

For Chairman Yellen, the variance of the estimated age is 22.10 and the standard deviation of the estimated age is \sqrt{22.10}=4.7 years.

For Dr. Sowell, the variance is 23.98 and the standard deviation is \sqrt{23.98}=4.9 years.

The variance of the estimator is an important property. It tells us about the spread of the estimate, i.e., how widely the estimate is distributed.

For any estimator \hat{\theta}, we compute the expected value E[\hat{\theta}] as a measure of the central tendency and how far it could be from the truth – Bias, and V[\hat{\theta}] as a measure of the spread of the distribution.

An estimator should be understood as a probability distribution.

The standard deviation of the estimator is called the standard error of an estimator. In our age guessing experiment, the standard error of Dr. Sowell’s estimated age is 4.9 years.

The standard error of \hat{\theta} is \sqrt{V[\hat{\theta}]}.


In lesson 67, while learning the concept of bias, we deduced that the sample mean \hat{\mu} = \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

In other words, we derived that E[\hat{\mu}] = \mu .

Now, let’s derive the variance of the sample mean V[\hat{\mu}] or V[\bar{x}].

For this, let us consider that we have a random sample x_{1}, x_{2}, ..., x_{n} of size n from a population with a mean \mu and variance \sigma^{2}.

Since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}.

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean.

V[\hat{\mu}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

V[\hat{\mu}] = V[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

V[\hat{\mu}] = \frac{1}{n^{2}}V[(x_{1}+x_{2}+...+x_{n})]

Where did that come from? Was it lesson 27? V[aX]=a^{2}V[X]

V[\hat{\mu}] = \frac{1}{n^{2}}(V[x_{1}]+V[x_{2}]+...+V[x_{n}])

Because the samples are independent, the covariance terms do not exist in the summation.

 V[\hat{\mu}] = \frac{1}{n^{2}}(\sigma^{2}+\sigma^{2}+...+\sigma^{2})=\frac{1}{n^{2}}*n \sigma^{2}=\frac{\sigma^{2}}{n}

The variance of the sample mean V[\hat{\mu}] = V[\bar{x}] =\frac{\sigma^{2}}{n}.

The standard error of the sample mean is \frac{\sigma}{\sqrt{n}}. It is a measure of the precision of the estimator.


Joe has a question

Do you remember last week’s dilemma?

Joe wants to choose a path; an estimator for the true population mean.

We tried to help him by computing the bias of each of the three estimators. The idea was to choose the one that has no bias. But it turned out that all three estimators are unbiased.

How then, can we select the best estimator?

Look at the estimators again.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

I told you that estimators should be understood as probability distributions and that we can compute the expected value E[\hat{\theta}] for the bias, and V[\hat{\theta}] to measure of the spread of the distribution.

Let’s compute the variance of these three estimators.

V[\hat{\mu_{1}}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

\hat{\mu_{1}} is the sample mean. We just derived its variance as V[\hat{\mu_{1}}] = \frac{\sigma^{2}}{n}.

So let’s work on the second estimator.

V[\hat{\mu_{2}}] = V[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}]

V[\hat{\mu_{2}}] = V[ x_{[1]} + \frac{1}{2}x_{[n]} - \frac{1}{2}x_{[1]}] = V[\frac{1}{2}x_{[1]} + \frac{1}{2}x_{[n]}]

V[\hat{\mu_{2}}] = \frac{1}{4}\sigma^{2} + \frac{1}{4}\sigma^{2}

V[\hat{\mu_{2}}] = \frac{1}{2}\sigma^{2}

The variance of the third estimators \hat{\mu_{3}} will be \sigma^{2} if the sample size is odd-numbered, or \frac{1}{2}\sigma^{2} if the sample size is even-numbered.

So, the three estimators \hat{\mu_{1}}, \hat{\mu_{2}}, \hat{\mu_{3}} are all unbiased. Their variances are as follows:

V[\hat{\mu_{1}}]=\frac{\sigma^{2}}{n}

V[\hat{\mu_{2}}]=\frac{\sigma^{2}}{2}

V[\hat{\mu_{2}}]=\sigma^{2} or \frac{\sigma^{2}}{2}

The three estimator probability distributions are centered on the truth (\mu), but their spreads are different. If the sample size n is greater than 2 (n>2), estimator \hat{\mu_{1}} has the lowest variance.

They will look like this visually.

Among all the estimators that are unbiased, we choose the one which has minimum variance. \hat{\mu_{1}} is the minimum variance unbiased estimator. It is more likely to produce an estimate close to \mu, the truth.

Among all estimators which are unbiased, choose the one which has minimum variance. This chosen \hat{\theta} is called the minimum variance unbiased estimator (MVUE) of \theta, and it is most likely among all unbiased estimators to produce an estimate close to the truth. This principle is called the principle of minimum variance unbiased estimation.


You must be wondering what if an estimator has low variance, but is biased. Like this.

Isn’t \hat{\theta_{1}} more likely to produce an estimate close to the truth?

Perhaps.

So there should be a way to combine bias and variance into one measure to help us with the selection.

The mean squared error is that measure. Let’s see how to derive this measure.

We have an estimator \hat{\theta} for the true paramter \theta.

The error of this estimator from the truth is \hat{\theta}-\theta.

The squared error is (\hat{\theta}-\theta)^{2}.

The mean squared error (MSE) is the expected value of the squared error.

E[(\hat{\theta}-\theta)^{2}]

MSE = E[(\hat{\theta}-\theta)^{2}]

MSE = V[\hat{\theta}-\theta] + (E[\hat{\theta}-\theta])^{2}

Can you tell how we wrote the above expression?

MSE = V[\hat{\theta}] + V[\theta] + (E[\hat{\theta}] - E[\theta])^{2}

MSE = V[\hat{\theta}] + (E[\hat{\theta}] - \theta)^{2}

This is because, E[\theta]=\theta and V[\theta]=0.

V[\hat{\theta}] is the variance of the estimator.

E[\hat{\theta}] - \theta is the bias of the estimator.

So, MSE = Variance + Bias^{2}

For unbiased estimators, the mean squared error is equal to the variance of the estimator.

We can use this measure to compare two estimators. If MSE(\hat{\theta_{1}})<MSE(\hat{\theta_{2}}) we can say that \hat{\theta_{1}} is a better estimator, a more efficient estimator in producing the truth.

The ratio \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is called relative efficiency of \hat{\theta_{2}} to \hat{\theta_{1}}.

If you can create an estimator that has low variance, but you have to compromise on the bias, you can go for it, as long as the reduced variance is greater than the squared bias.

In estimating true parameters, there is always a tradeoff between bias and variance.

To close off for the week, let me ask you three questions.

If \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is the relative efficiency, what is consistency?

Is the estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} unbiased? Did you do your homework?

When was the last time you were on RStudio?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 67 – Bias

“Nooooo,” “No way…” “Ooooh,” “Yessss,” “I told ya,” “I knew it.”

These are some of the funny reactions from our high-school students when the true age is finally revealed; because the bias in their estimation is also revealed.

Every year, I start off the data analysis boot camp with this game from the bag of tricks for teaching statistics by Andrew Gelman.

The students sit in groups of four and look at rolling pictures to guess their ages. They have their own formulas (estimators) for the age!

What you saw in the opening image are the guessed ages of the former Chairman of the Federal Reserve, Jannet Yellen and the great economist Thomas Sowell.

Do you see anything peculiar?

Yes. The estimated age of Chairman Yellen is somewhat close to her true age (as per Wikipedia) and the estimated age of Dr. Sowell is not even close (again, as per Wikipedia).

Let’s compute the average of all the guesses, the expected value of the estimated age.

For Chairman Yellen it is 69.69 years, a difference of 1.31 years from her true age.

Pretty close to the truth. The students’ estimate has very little bias.

For Dr. Sowell, it is 61.88 years, a difference of 25.13 years from his true age.

Far from the truth. The students’ estimate is biased.

Let’s spend a little more time on the estimated ages.

There are 16 groups that guessed the ages based on a picture of the person. Let us use a notation \hat{\theta} for this estimate. So, after heavy thinking and vehement deliberation within the group, the estimate from group A is \hat{\theta}=73 for Chairman Yellen.

Group B is also filled with thinkers and negotiators, so their estimate for Chairman Yellen’s age is \hat{\theta}=70.

Likewise, the 16 groups provided a distribution of estimates. The expected value E[\hat{\theta}] of this distribution is 69.69 years.

The true age \theta is 71 years.

The bias is E[\hat{\theta}]-\theta=-1.31

Look at this visual. I am taking the 16 estimates and plotting them as a histogram (or a probability density). The true age is also shown in the histogram. The estimates are distributed about the true age. The mean of this distribution is very close to 71.

Dr. Sowell on the other hand, somehow seemed younger in the minds of the high-school students. The expected value of the estimate E[\hat{\theta}] is 61.88 years. The true age is 87 years, a bias (E[\hat{\theta}]-\theta) of around 25 years.The bias could be because of the picture I am using, his pleasant smile or his aura!

This idea of measuring the closeness of the estimate to the true parameter is called the bias of the estimate.

An estimator (\hat{\theta}) (i.e., a formulation to compute the true population parameter) is unbiased if on average it produces values that are close to the truth. E[\hat{\theta}]=\theta. In other words, if we have a distribution of the estimates, and if their mean is close (or centered) to the truth, the estimator is unbiased.

♦Θ♦


In lesson 63, we learned that the maximum likelihood estimator of p, the true probability of success for a Binomial distribution is \hat{p}=\frac{r}{n}, where r is the number of successes in a sequence of n independent and identically distributed Bernoulli random numbers. Do you think \hat{p} is unbiased?

Let’s check out.

The bias of \hat{p} is defined by E[\hat{p}]-p.

E[\hat{p}]=E[\frac{r}{n}]=\frac{1}{n}E[r]=\frac{1}{n}*np=p

The key point in the above derivation is that the expected value of the number of successes (r) for a binomial distribution is np.

The estimate \hat{p}=\frac{r}{n} is an unbiased estimate of the true probability of success.

The distribution of the estimate \hat{p} will be centered on the true probability p.

§⇔§

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Do you think they are unbiased?

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean. Let’s compute the bias of this estimate.

E[\hat{\mu}] = E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

E[\hat{\mu}] = E[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}E[(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}(E[x_{1}]+E[x_{2}]+...+E[x_{n}])

Since x_{1}, x_{2}, ..., x_{n} are random samples from the population with true parameters \mu and \sigma, E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. In other words, since x_{i}s are independent and identically distributed, they will have the same expected value and the variance as the population.

So, E[\hat{\mu}] = \frac{1}{n}(\mu+\mu+...+\mu)=\frac{1}{n}*n \mu=\mu

The sample mean \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

What about the variance estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}?

You know the chore now. Compute the bias of this estimator and find out for yourself. Passive minds have to wait until next week.

♦Θ♦


Joe is a smart chap. The other day, he asked me if he could use any other estimator to get the population mean (truth). I probed him further, and he came up with three estimators to estimate the true mean \mu.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

The first estimator \hat{\mu_{1}} is known to all of us. It is the sample mean.

The second estimator \hat{\mu_{2}} is computing the central tendency as the sum of the smallest value x_{[1]} and half the range of the data \frac{x_{[n]}-x_{[1]}}{2}. It is a reasonable way to tell where the center will be.

The third estimator \hat{\mu_{3}} is the median of the sample.

Can you help Joe choose the correct path?

Your immediate reaction would be, “compute the bias for each estimator and select the one that has the smallest bias.”

Fair enough. Let’s try that then.

We already showed that the first estimator \hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is unbiased.

Employing the bias computation for the second estimator, we can see that

E[\hat{\mu_{2}}] = E[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}] = E[x_{[1]}] + \frac{1}{2}(E[x_{[n]}]-E[x_{[1]}])

E[\hat{\mu_{2}}] = \mu + \frac{1}{2}(\mu-\mu)=\mu

The second estimator is also unbiased.

Now, depending on whether there is even number of samples or an odd number of samples, you can derive that the third estimator is also unbiased.

How then, can Joe select the best estimator?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 66 – Spare a moment

Hello! Do you have a moment?

What is the first thing that comes to your mind when you hear the word “moment?”

For some of you, it is “let me capture and share the moment.”

For some of you, it is the coffee cup cliches — “live in the moment,” “enjoy every moment,” and “be happy for this moment.”

Some of you may actually be believing in the “power of the present moment” or “don’t wait for the perfect moment, take the moment and make it perfect” stuff!

How many of you remembered torque from your high school physics lesson?

The torque of a force, or moment of a force about the axis of rotation?

\Gamma = \textbf{r} * \textbf{F}

Moment of a force is the product of the force and its distance from an axis. It is a measure of the tendency of the force to cause a body to rotate.

I am sure you are thinking that the moment you spared for me is not worth it. You don’t want to recollect those moments from the horrid highschool days.

Bear with me for a few more moments. You will see why we are talking moments in data analysis.

You know by now that we can use a sample to estimate the true value of the parameter of the population. The formulation used to get that estimate is called an estimator, and there are methods of estimation of these parameters. In lesson 63, lesson 64 and lesson 65, we dabbled with the method of Maximum Likelihood Estimation (MLE).

Another method of estimating population parameters from the sample is the Method of Moments.

Let’s see how the idea of the moment from physics is related to the Method of Moments.

Assume that we have a sample of five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} from a population with a probability density function f_{X}(x). Let’s place them on a number line like this.

Imagine that each number is represented by an equal-weighted ball. Since each data point is independent and identically distributed, i.e., equally likely, it is reasonable to assume that the probability of their individual occurrences (\frac{1}{n}) is the mass of the data point.

For any data point, the torque, or the moment about the axis of rotation is \Gamma = \textbf{r} * \textbf{F}.

For example, a data point at a distance of x_{i} from the axis has a moment of x_{i}*\frac{1}{n}. I am dropping the standard gravity term, and you will see why shortly.

Since there are more than one forces acting on the sample number line, we can compute the total torque acting as \Gamma = {\displaystyle \sum_{i=1}^{n}r_{i} * F_{i}}.

In other words, the total moment of the sample data about the axis of rotation is {\displaystyle \sum_{i=1}^{n}x_{i}*\frac{1}{n}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.

Did you notice that this is the equation for the sample mean?\bar{x}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. The centroid of the data points.

The population has a probability density function f_{X}(x).

We saw in Lesson 64 on how to divide the space into equal intervals of length dx. The probability that a given sample data point x_{i} falls in the range dx is f_{X}(x_{i})dx for an infinitesimally small range dx. Like in the sample case, we can assume this is the mass of any subdivision. Hence, the moment of this infinitesimal body about the rotation axis is x_{i}*f_{X}(x_{i})dx.

For the entire population, the total moment is {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}.

I know. You will tell me that this is the expected value E[X], or the centroid of the probability density function.

E[X] = {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}

If the five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} are originating from a population with a probability density function f_{X}(x), it is reasonable to equate the total moment of the sample data to the total moment of the population data. The centroid of the sample is equal to the centroid of the probability density functions.

{\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} =\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


Let me pester you again!

What is the moment of inertia?

No worries. Let’s activate your deep memory neurons. Imagine that the sample data points are rotating on the plane about the axis with a velocity v = r\omega, where \omega is the constant angular velocity of the plane. In our example of five data points, the velocity of any given point will be v_{i} = x_{i}\omega.

If the body is rotating, it will be subject to a tangential force of m\frac{dv}{dt}. For our sample point which is at a radial distance of x_{i} from the axis, the tangential force will be \frac{1}{n}\frac{dv}{dt} = \frac{1}{n}x_{i}\frac{d \omega}{dt}.

To simplify life, let’s call \frac{d \omega}{dt}=\alpha, the angular acceleration of the point.

The tangential force is \frac{1}{n}x_{i}\alpha, and the torque of the tangential force applied on one point is x_{i}*\frac{1}{n}x_{i}\alpha = \frac{1}{n}x_{i}^{2}\alpha.

The total torque of all the tangential forces = {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}\alpha}.

The term {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}} is called the moment of inertia of the points.

Just looking at the equation, you can say that the moment of inertia of points far apart is larger than the moment of inertia of the points closer to the axis.

This moment of inertia is called the second moment and it gives us a sense of how spread out the data points are about the axis.

Sample moment of inertia = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

By extension to the population, we can write the population moment of inertia, or the population second moment as

E[X^{2}]={\displaystyle \int_{- \infty}^{\infty}x^{2}f_{X}(x)dx}.

Notice the generalization E[X^{2}] for the second moment, just like E[X]={\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} is the first moment.

Population moments are the expected values of the powers of the random variable X.

I can sense that you are charged up by the word “spread.” Yes, the second moment will be used to get the second parameter, the variance of the distribution.

We can equate the sample moment of inertia to the population moment of inertia.


E[X^{2}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}


This form of equating sample moments to the population moments is called the Method of Moments for parameter estimation.

It was developed by Karl Pearson in the early 1900s. Pearson demonstrated this method by fitting different forms of frequency curves (probability density functions) to data by calculating as many moments from the sample data as there are parameters of the probability density function.

A generalized representation of the Method of Moments can be done like this.


E[X^{k}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{k}}


To estimate the k parameters of the probability density function, we can equate the k population moments to the k sample moments.

If the probability density function has one parameter, the first moment equation is sufficient. If the function has two parameters, we have the first and the second moment equations — two unknowns, two equations to solve. If the function has three parameters, we go for three moment equations.

Time for some action

Let’s solve for the parameters of the normal distribution.

There is a sample of n data points x_{1}, x_{2}, x_{3}, ..., x_{n} that we think originated from a normal distribution (population) with a probability density function f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}.

x_{1}, x_{2}, x_{3}, ..., x_{n} \sim N(\mu, \sigma)

\mu and \sigma are the parameters of the normal probability density function. Let’s apply the Method of Moments to estimate these parameters from the sample. In other words, let’s formulate the equations for \mu and \sigma.

The first population moment is E[X^{1}]. The first sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{1}}.

For the normal distribution, we know that the expected value E[X] is the mean \mu.

\mu = E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

The second population moment is E[X^{2}]. The second sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

E[X^{2}]. Hmm. You have to go down the memory lane, to Lesson 26 to realize that E[X^{2}]=V[X] + (E[X])^{2}.

.
.
.
Did you see that? Yes, so E[X^{2}]=\sigma^{2} + \mu^{2}. Equating this to the sample second moment, we get

\sigma^{2} + \mu^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - \mu^{2}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - (\bar{x})^{2}

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - n\bar{x}^{2}\Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2n\bar{x}^{2} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}n\bar{x} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}\sum x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + \sum \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \sum (x_{i}^{2} - 2\bar{x} x_{i} + \bar{x}^{2})

\sigma^{2} = \frac{1}{n} \sum (x_{i} - \bar{x})^{2}


Let’s spend a few more moments and solve for the parameters of the Gamma distribution.

The probability density function of the Gamma distribution is

f(x) = \frac{\lambda e^{-\lambda x}(\lambda x)^{r-1}}{(r-1)!}

It has two parameters, r and \lambda.

\lambda is called the scale parameter. It controls the width of the distribution, and r is called the shape parameter because it controls the shape of the distribution.

We know that the expected value of the Gamma distribution E[X] = \frac{r}{\lambda}. If you forgot this, solve the integral \int xf(x)dx for the Gamma distribution. Equating this first population moment to the sample moment, we can see that

\frac{r}{\lambda} = \frac{1}{n}\sum x_{i} = \bar{x}

r = \lambda \bar{x}

Then, we know that V[X] for the Gamma distribution is \frac{r}{\lambda^{2}}.

Using this with the second population moment, E[X^{2}]=V[X] + (E[X])^{2}, we can see that E[X^{2}] = \frac{r(r+1)}{\lambda^{2}}.

We can now equate the second population moment to the second sample moment.

E[X^{2}] = \frac{1}{n}\sum x_{i}^{2}

\frac{r(r+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\lambda \bar{x} (\lambda \bar{x}+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\bar{x}(\lambda \bar{x}+1)}{\lambda} = \frac{1}{n}\sum x_{i}^{2}

 \bar{x} + \lambda \bar{x}^{2} = \frac{\lambda}{n}\sum x_{i}^{2}

\lambda (\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}) = \bar{x}

\lambda = \frac{\bar{x}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

So,

r = \frac{\bar{x}^{2}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

Very tiring, I know. You will rarely find all the algebra in textbooks. So, the moments are worth it.

I asked you to spare a moment. Now you are going “this is quite a moment.” What do you think? Would you remember the Method of Moments next time you spare a moment?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)