Lesson 75 – Fiducia on the variance

This morning officer Jones is deployed to NY state highway 17A to monitor the speeding vehicles. Its been a rather damp morning; his computer has recorded the speeds of only 20 vehicles so far.

He is nicely tucked away in a corner where jammers cannot find him. With no vehicles in sight, he started playing with the data.

He computes the average vehicle speed.

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}

Mean vehicle speed \bar{x} = 50.10 miles per hour.

But officer Jones is a variance freak. He computes the sample variance and the sample standard deviation.

s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}

The sample variance is 5.9 mph^{2}, and the sample standard deviation is 2.43 mph.

I wonder how large this deviation can get,” thought officer Jones.

Can we answer his question?

The answer lies in knowing the confidence interval on the variance.

We now know how to construct the confidence interval on the true mean based on the sample.

The interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}] is the 100(1-\alpha)\% confidence interval of the population mean \mu.

The basis for this is the idea that we can find an interval that has a certain probability of containing the truth.

In other words, a 100(1-\alpha)\% confidence interval for a parameter \theta is an interval [l, u] that has the property P(l \le \theta \le u) = 1-\alpha.

We can apply the same logic to the variance. Office Jones is wondering about the true standard deviation \sigma or true variance \sigma^{2}. He is wondering if he can derive the interval for the true variance given his limited data.

How can we derive the confidence interval on the variance?

Remember what we did for deriving the confidence interval on the mean. We investigated the limiting distribution of \bar{x}, the sample mean and used the probability statement P(l \le \theta \le u) = 1-\alpha on this distribution to derive the end-points, the upper and lower confidence limits.

Let’s do the same thing here.

What is the limiting distribution for sample variance?

.

.

.

Those who followed lesson 73 carefully will jump up to say that it is related to the Chi-square distribution.

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

Okay, no problem. We will do it again.

Let’s take the equation of the sample variance s^{2} and find a pattern in it.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}

Move the n-1 over to the left-hand side and do some algebra.

(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}

(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}

(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}

(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}

Let’s divide both sides of the equation by \sigma^{2}.

\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}

The right-hand side now is the sum of squared standard normal distributions.

\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2} ; sum of squares of (n - 1) standard normal random variables.

As we learned in lesson 53, if there are n standard normal random variables, Z_{1}, Z_{2}, ..., Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom.

Its probability density function is f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi > 0 and 0 otherwise.

Since we have \frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

Depending on the degrees of freedom, the distribution of \frac{(n-1)s^{2}}{\sigma^{2}} looks like this.

Smaller sample sizes imply lower degrees of freedom and the distribution will be highly skewed; asymmetric.

Larger sample sizes or higher degrees of freedom will tend the distribution to symmetry.


We can now apply the probability equation and define a 100(1-\alpha)\% confidence interval for the true variance \sigma^{2}.

P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha

\alpha is between 0 and 1. You know that a 95% confidence interval means \alpha = 0.05, and a 99% confidence interval means \alpha = 0.01.

\chi_{l,n-1} and \chi_{u,n-1} are the lower and upper critical values from the Chi-square distribution with n-1 degrees of freedom.

The main difference here is that the Chi-square distribution is not symmetric. We should calculate both the lower limit and the upper limit that correspond to a certain level of probability.

Take for example, a 95% confidence interval. We need \chi_{l,n-1} and \chi_{u,n-1}, the quantiles that yield a 2.5% probability in the right tail and 2.5% probability in the left tail.

P(\chi \le \chi_{l,n-1}) = 0.025

and

P(\chi > \chi_{u,n-1}) = 0.025

Going back to the probability equation
P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha

With some rearrangement within the inequality, we get

P(\frac{(n-1)s^{2}}{\chi_{u,n-1}} \le \sigma^{2} \le \frac{(n-1)s^{2}}{\chi_{l,n-1}}) = 1-\alpha

The interval [\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}] is called the 100(1-\alpha)\% confidence interval of the population variance \sigma^{2}.

We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation.

The interval [\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}] is called the 100(1-\alpha)\% confidence interval of the population standard deviation \sigma.

Let’s now go back to the data and construct the confidence interval on the variance and the standard deviation.

Officer Jones has a sample of 20 vehicles. n=20. The sample variance s^{2} is 5.9 and the sample standard deviation s is 2.43.

Since n = 20, \frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with 19 degrees of freedom. The lower and upper critical values at the 95% confidence interval \chi_{l,19} and \chi_{u,19} are 8.90 and 32.85.

You must be wondering about the tediousness in computing these quantiles each time there is a sample.

That problem is solved for us. Like the t-table, the upper tail probabilities of the Chi-square distribution for various degrees of freedom are tabulated in the Chi-square table.

You can get them from any statistics textbooks, or a simple internet search on “Chi-square table” will do. Here is an example.

The first column is the degrees of freedom. The subsequent columns are the computed values for P(\chi > \chi_{u}), the right tail probability.

For our example, say we are interested in the 95% confidence interval, we look under df = 19 and identify \chi^{2}_{0.975} = 8.90 as the lower limit that yields a probability of 2.5% on the left tail, and \chi^{2}_{0.025} = 32.85 as the upper limit that yields a probability of 2.5% on the right tail.

Substituting these back into the confidence interval equation, we get

P(\frac{19*5.90}{32.85} \le \sigma^{2} \le \frac{19*5.90}{8.90}) = 0.95

The 95% confidence interval for the true variance is [3.41, 12.59].

If we take the square root of these limits, we get the 95% confidence interval for the true standard deviation.

P(1.85 \le \sigma \le 3.54) = 0.95

The 95% confidence interval for the true variance is [1.85 3.54].

So, to finally answer office Jones’ question, at the 95% confidence level, the deviation in vehicle speed can be as large as 3.54 mph and as low as 1.85 mph. The interval itself is random. As you know by now the real interpretation is:

If we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample variance \sigma^{2} and the Chi-square critical limits from the Chi-square distribution, in the long-run, 95% of these intervals will contain the true value of \sigma^{2}.

A few vehicles must have sneaked by in the short run.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 74 – Deriving confidence from t


What we know so far

The 100(1-\alpha)\% confidence interval for the true mean \mu is [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]

If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown \sigma.

However, for small sample sizes, the sample standard deviation s is itself subject to error. In other words, it may be far from the true value of \sigma. Hence, we cannot assume that \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} will tend to a normal distribution, which was the basis for deriving the confidence intervals in the first place.

Last week’s mathematical excursion took us back in time, introduced us to “Student” and the roots of his famous contribution, the Student’s t-distribution.

“Student” derived the frequency distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} to be a t-distribution with (n-1) degrees of freedom. We paddled earnestly through a stream of functions to arrive at this.

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}

The probability of t within any limits is fully known if we know n, the sample size of the experiment. The function is symmetric with an expected value and variance of: E[T] = 0 and V[T] = \frac{n-1}{n-3}.

While the t-distribution resembles the standard normal distribution Z, it has heavier tails, i.e., it has more probability in the tails than the normal distribution. As the sample size increases (n \to \infty) the t-distribution approaches Z.


Can we derive the confidence interval from the t-distribution?

Let’s follow the same logic used to derive the confidence interval from the normal distribution. The only difference now will be that

\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t

Suppose we are interested in deriving the 95% confidence interval for the true mean \mu, we can use the probability rule

P(-t_{0.025,(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{0.025,(n-1)}) = 0.95

It is equivalent to saying that there is a 95% probability that the variable \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} is between -t_{0.025,(n-1)} and t_{0.025,(n-1)}, the 2.5 percentile value of t.

t_{0.025,(n-1)} is like Z_{0.025}. While Z_{0.025} = -1.96, t_{0.025,(n-1)} will depend on the sample size n.

Notice I am using (n-1), the degrees of freedom in the subscript for t to denote the fact that the value will be different for a different sample size.

To generalize, we can define a 100(1-\alpha)\% confidence interval for the true mean \mu using the probability equation

P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha

\alpha is between 0 and 1. For a 95% confidence interval, \alpha = 0.05, and for a 99% confidence interval, \alpha = 0.01. Like how the Z-critical value is denoted using Z_{\frac{\alpha}{2}}, the t-critical value can be denoted using t_{\frac{\alpha}{2},(n-1)}.

We can modify the inequality in the probability equation to arrive at the confidence interval.

P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha

Multiplying throughout by \frac{s}{\sqrt{n}}, we get

P(-t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}} \le \bar{x}-\mu \le t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha

Subtracting \bar{x} and multiplying by -1 throughout, we get

P(\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha

This interval [\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}] is called the 100(1-\alpha)\% confidence interval of the population mean.

As we discussed before in lesson 72, the interval itself is random since it is derived from \bar{x} and s. A different sample will have a different \bar{x} and s, and hence a different interval or range.


Solving Jenny’s problem

Let us develop the 95% confidence intervals of the mean water quality at the Rockaway beach. Jenny was using the data from the ROCKAWAY BEACH 95TH – 116TH. She identified 48 data points (n = 48) from 2005 to 2018 that exceeded the detection limit.

The sample mean \bar{x} is 7.9125 counts per 100 ml. The sample standard deviation s is 6.96 counts per 100 ml.

The 95% confidence interval of the true mean water quality (\mu) is

\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}

How can we get the value for t_{0.025,(n-1)}, the 2.5 percentile from the t-distribution with (n-1) degrees of freedom?

.

.

.

You must have started integrating the function f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} into its cumulative density function F(t).

Save the effort. These are calculated already and are available in a table. It is popular as the t-table. You can find it in any statistics textbook, or simply type “t-table” in any search engine and you will get it. There may be slight differences in how the table is presented. Here is an example.

This table shows the right-sided t-distribution critical value t_{\frac{\alpha}{2},(n-1)}. Since the t-distribution is symmetric, the left tail critical values are -t_{\frac{\alpha}{2},(n-1)}. You must have noticed that the last row is indicating the confidence level 100(1-\alpha)\%.

Take, for instance, a 95% confidence interval, \frac{\alpha}{2}=0.025. The upper tail probability p is 0.025. From the table, you look into the sixth column under 0.025 and slide down to df=n-1. For instance, if we had a sample size of 10, the degrees of freedom are df = 9 and the t-critical (t_{0.025,9}) will be 2.262; like this:

Since our sample size is 48, the degrees of freedom df = 47.

In the sixth column under upper tail probability 0.025, we should slide down to df = 47. Since there is no value for df = 47, we should interpolate from the values 2.021 (df = 40) and 2.009 (df = 50).

The t-critical value for 95% confidence interval and df = 47 is 2.011. I got it from R. We will see how in an R lesson later.

This table is also providing Z_{\frac{\alpha}{2}} at the end. See z^{*}=1.96 for a 95% confidence interval.

Let’s compute the 95% confidence interval for the mean water quality.

\bar{x} - 2.011\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + 2.011\frac{s}{\sqrt{n}}

7.9125 - 2.011\frac{6.96}{\sqrt{48}} \le \mu \le 7.9125 + 2.011\frac{6.96}{\sqrt{48}}

5.892 \le \mu \le 9.933

Like in the case of the confidence interval derived from the normal distribution, if we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample mean (\bar{x}), sample standard deviation (s) and the t-critical from the t-distribution, in the long-run, 95% of these intervals will contain the true value of \mu.

Here is how eight different 95% confidence intervals look relative to the truth. These eight intervals are constructed based on the samples from eight different locations. In the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

I am also showing the confidence intervals derived from the normal distribution and known variance assumption (\sigma=9.95). They are in green color.

Can you spot anything?

How do they compare?

What can we learn about the width of the intervals derived from the normal distribution (Z) and the t-distribution?

Is there anything that is related to the sample size?

Think about it until we come back with a lesson in R for confidence intervals.

There are other applications of the t-distribution that we will learn in due course of time.

Remember that you will cross a “t” whenever there is error.

I will end with these notes from Fisher in his paper “Student” written in 1939 in remembrance of W.S. Gosset.

“Five years, however, passed, without the writers in Biometrika, the journal in which he had published, showing any sign of appreciating the significance of his work. This weighty apathy must greatly have chilled his enthusiasm.”

“The fruition of his work was, therefore, greatly postponed by the lack of appreciation of others. It would not be too much to ascribe this to the increasing dissociation of theoretical statistics from the practical problems of scientific research.”

It is now 110 years since he published his famous work using a pseudo name “Student.” Suffice it to say that “Student” and his work will still be appreciated 110 years from now, and people will derive confidence from the “t.”

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 73 – Learning from “Student”

On the fifteenth day of July 2018, Jenny and Joe discussed the confidence interval for the mean of the population.

The 100(1-\alpha)% confidence interval for the true mean \mu is [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}].

\bar{x} is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean \mu and true standard deviation \sigma.

The end-points [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] are called the lower and the upper confidence limits.

While developing the confidence interval of the mean water quality at the beach with a sample of 48, Jenny pointed out that the value for the true standard deviation, \sigma is not known.

Joe collected a much larger sample (all data points available for the Rockaway beach) and computed an estimate for the true standard deviation. Based on the principle of consistency, he suggested that this estimate is close to the truth, so can be used as \sigma.

Jenny rightly pointed out that not always we will have such a large sample. Most often, the data is limited.

Joe’s suggestion was to use sample standard deviation, s, i.e., the estimated standard deviation from the limited sample in place of \sigma. However, Jenny was concerned that this will introduce more error into the estimation of the intervals.

March 1908

This was a concern for W. S. Gosset (aka “Student”) in 1908. He points out that one way of dealing with this difficulty is to run the experiment many times, i.e., collect a large enough sample so that the standard deviation can be computed once for all and used for subsequent similar experiments.

He further points out that there are numerous experiments that cannot easily be repeated often. In the situation where a large sample cannot be obtained, one has to rely on the results from the small sample.

The standard deviation of the population (\sigma) is not known a priori.

The confidence interval Joe and Jenny derived is based on the conjecture that the distribution of the sample mean (\bar{x}) tends to a normal distribution. \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}), and if the sample size is large enough, it would be reasonable to substitute sample standard deviation s in place of the population standard deviation \sigma. But it is not clear how “large” the sample size should be.

The sample standard deviation can be computed as s=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}. While s is a perfectly good estimate of \sigma, it is not equal to \sigma.

“Student” pointed out that when we substitute s for \sigma, we cannot just assume that \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} will tend to a normal distribution just like Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1).

He derived the frequency distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} in his landmark paper “The Probable Error of a Mean.”

This distribution came to be known as the Student’s t-distribution.

In his paper, he did not use the notation t though. He referred to it as quantity z, obtained by dividing the distance between the mean of a sample and the mean of the population by the standard deviation of the sample.

He derived the distribution of the sample variance and the sample standard deviation, showed that there is no dependence between s and \bar{x} and used this property to derive the joint distribution of \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}.

Today, we will go through the process of deriving the probability distributions for the sample variance s^{2}, sample standard deviation s and the quantity t.

t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}

It is important that we know these distributions. They will be a recurring phenomenon from now on and we will be using them in many applications.

I am presenting these derivations using standard techniques. There may be simpler ways to derive them, but I found this step by step thought and derivation process enriching.

During this phase, I will refer back to “Student’s” paper several times. I will also use the explanations given by R.A Fisher in his papers on “Student.”

You can follow along these steps using a pen and paper, or you can just focus on the thought process and skip the derivations, either way, it is fun to learn from “Student.” Trust me.


t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}

To know the distribution of t, we should know the distributions of \bar{x} and s.

We already know that \bar{x} tends to a normal distribution; \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}), but what is the limiting distribution of s, i.e., what is the probability distribution function f(s) of the sample standard deviation?

To know this, we should first know the limiting distribution of the sample variance s^{2}, from which we can derive the distribution of s.

What is the frequency distribution of the sample variance?

Expressing variance as the sum of squares of normal random variables.

Let’s take the equation of the sample variance s^{2} and see if there is a pattern in it.

s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}

Move the n-1 over to the left-hand side and do some algebra.

(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}

(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}

(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}

(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}

(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}

Let’s divide both sides of the equation by \sigma^{2}.

\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}

The right-hand side now looks like the sum of squared standard normal distributions.

\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

Sum of squares of (n – 1) standard normal random variables.

Does that ring a bell? Sum of squares is the language of the Chi-square distribution. We learned this in lesson 53.

If there are n standard normal random variables, Z_{1}, Z_{2}, ..., Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom. Its probability density function is f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi > 0 and 0 otherwise.

Sine we have \frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}

\frac{(n-1)s^{2}}{\sigma^{2}} follows a Chi-square distribution with (n-1) degrees of freedom.

\frac{(n-1)s^{2}}{\sigma^{2}} \sim \chi^{2}_{n-1} with a probability distribution function

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

The frequency distribution of the sample variance.

It turns out that, with some modification, this equation is the frequency distribution f(s^{2}) of the sample variance.

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}(\frac{(n-1)s^{2}}{2\sigma^{2}})^{\frac{n-3}{2}}}{(\frac{n-3}{2})!} e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(\frac{n-1}{\sigma^{2}})^{\frac{n-1}{2}}\frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2^{\frac{n-3}{2}}}(s^{2})^{\frac{n-3}{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{1}{\frac{n-1}{\sigma^{2}}}[\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}]

The above equation can be viewed as f(aX) = \frac{1}{a}f(X) where X = s^{2} and a = \frac{n-1}{\sigma^{2}}.

These few steps will clarify this further.

Let Y = aX

P(Y \le y) = P(aX \le y)

P(Y \le y) = P(X \le \frac{1}{a}y)

F_{Y}(y) = F_{X}(\frac{1}{a}y)

\frac{d}{dy}(F(y)) = \frac{d}{dy}F_{X}(\frac{1}{a}y)

Applying the fundamental theorem of calculus and chain rule together, we get,

f(y) = \frac{1}{a}f(\frac{1}{a}y)

f(aX) = \frac{1}{a}f(X)

Hence, the frequency distribution of s^{2} is

f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

From f(s^{2}), we can derive the probability distribution of s, i.e., f(s).

WHAT IS THE FREQUENCY DISTRIBUTION OF THE SAMPLE Standard Deviation?

Here, I refer you to this elegant approach by “Student.”

“The distribution of s may be found from this (s^{2}), since the frequency of s is equal to that of s^{2} and all that we must do is to compress the base line suitably.”

Let f_{1} = f(s^{2}) be the distribution of s^{2}.

Let f_{2} = f(s) be the distribution of s.

Since the frequency of s^{2} is equal to that of s, we can assume,

f_{1}ds^{2} = f_{2}ds

In other words, the probability of finding s^{2} in an infinitesimal interval ds^{2} is equal to the probability of finding s in an infinitesimal interval ds.

We can reduce this using substitution as

2sf_{1}ds = f_{2}ds

or

f_{2} = 2sf_{1}

The frequency distribution of s is equal to 2s multiplied by the frequency distribution of s^{2}.

Hence, the frequency distribution of s is

f(s) = 2s\frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

WHAT IS THE FREQUENCY DISTRIBUTION OF t?

We are now ready to derive the frequency distribution of t.

Some of the following explanation can be found in R. A. Fisher’s 1939 paper. I broke it down step by step for our classroom.

t = \frac{\bar{x}-\mu}{s/\sqrt{n}}

\bar{x} - \mu = \frac{st}{\sqrt{n}}

For a given value of s, d\bar{x} = \frac{s}{\sqrt{n}}dt

The frequency distribution of \bar{x} is

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(\bar{x}-\mu)^{2}}

Remember, \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}}).

Substituting \bar{x} - \mu = \frac{st}{\sqrt{n}}

The distribution becomes

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{n}{2\sigma^{2}}(st/\sqrt{n})^{2}}

or

\frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}

For a given value of s, the probability that \bar{x} will be in d\bar{x} is

df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}d\bar{x}

Substituting d\bar{x} = \frac{s}{\sqrt{n}}dt, we can get

df = \frac{\sqrt{n}}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \frac{s}{\sqrt{n}}dt

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} s dt

Fisher points out that for all values of s, we can substitute the frequency distribution of s and integrate it over the interval 0 and \infty. This can be done because \bar{x} and s are independent.

So the joint distribution becomes

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} s dt e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}} \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

In the next few steps, I will rearrange some terms to convert the integral into a recognizable form.

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} dt \int_{0}^{\infty}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}s \frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}} \frac{1}{\sigma^{n-1}}dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}\frac{(n-1)^{\frac{n}{2}}}{(n-1)^{1/2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{2\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-3}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{1}{2^{\frac{n-2}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{t^{2}}{2\sigma^{2}}s^{2}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}ds

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-\frac{n -1 + t^{2}}{2\sigma^{2}}s^{2}}ds

Hang in there. We will need some more concentration!

Let k = \frac{n - 1 + t^{2}}{2\sigma^{2}}

The equation becomes

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-ks^{2}}ds

Some more substitutions.

Let p = ks^{2}

Then dp = 2ksds

The equation can be reduced as

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}s^{n-1} e^{-p}\frac{1}{2ks}dp

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \int_{0}^{\infty}\frac{1}{2k}s^{n-2} e^{-p}dp

Since s = \frac{p^{1/2}}{k^{1/2}}

The equation becomes,

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2k^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

Replacing k = \frac{n - 1 + t^{2}}{2\sigma^{2}}

The equation becomes

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{1}{2(\frac{n - 1 + t^{2}}{2\sigma^{2}})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

Some more reduction …

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2\sigma^{2})^{n/2}}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}}\frac{1}{\sigma^{n}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}}\frac{2}{2^{\frac{n}{2}}} dt \frac{(2^{n/2}\sigma^{n})}{2(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n}{2}} dt \frac{1}{(n - 1 + t^{2})^{n/2}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(\frac{n-1+t^{2}}{n-1})^{\frac{n}{2}}}\int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

df = \frac{1}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} dt \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} \int_{0}^{\infty}p^{\frac{n-2}{2}} e^{-p}dp

The integral you see on the right is a Gamma function that is equal to (\frac{n-2}{2})! for positive integers.

There we go, the t-distribution emerges

df = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}} dt

The probability distribution of t is

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}


It is defined as t-distribution with (n-1) degrees of freedom. As you can see, the function only contains n as a parameter.

The probability of t within any limits is fully known if we know n, the sample size of the experiment.

“Student” also derived the moments of this new distribution as

E[T] = 0

V[T] = \frac{n-1}{n-3}

The function is symmetric and resembles the standard normal distribution Z.

The t-distribution has heavier tails, i.e. it has more probability in the tails than the normal distribution. As the sample size increases (i.e., as n \to \infty) the t-distribution approaches Z.

You can look at the equation and check out how it converges to Z in the limit when n \to \infty.

These points are illustrated in this animation.

I am showing the standard Z in red color. The grey color functions are the t-distributions for different values of n. For small sample sizes, the t-distribution has fatter tails — as the sample size increases, there is little difference between T and Z.

😫😫😫

I am pretty sure you do not have the energy to go any further for this week. I don’t too.

So here is where we stand.

f(s^{2}) = \frac{1}{2}\frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-3}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

f(s) = \frac{1}{(\frac{n-3}{2})!}(n-1)^{\frac{n-1}{2}}\frac{1}{2^{\frac{n-3}{2}}}\frac{s^{n-2}}{\sigma^{n-1}}e^{-\frac{1}{2}*\frac{(n-1)s^{2}}{\sigma^{2}}}

and finally

The probability distribution of t is

f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}

See you next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 72 – Jenny’s confidence, on the average

Meet Jenny. Jenny is bright and intelligent and is known as “the problem solver” among her friends. She usually goes unnoticed in the crowd due to her calm and composed nature, but by god, she is assertive when it is most needed. Her analytical neurons are razor sharp and you cannot mumbo-jumbo with her. She loves science fiction, history, and occasional Woody Allen. Oh, and she likes swimming, surfing, summer, and beaches.

Her second summer Rockaway beach trip is coming up and she is excited. With all the surfing kit packed, she pulled up the NYC Beach Water Quality website to monitor the status. “The Enterococci Bacteria Count is within limit!” she thought.

She got busy with other things, but the thought of beach samples bothered her. “Where do they take these samples from? They show Rockaway beach, but does the sample represent the whole beach? How many samples do they take? If they use one sample, how do we know what the truth is? I have been to this place several times, I wonder what the true water quality of the beach is?” These questions kept her awake. I told you she is a problem solver.

The next morning she met Joe to discuss the problem. Friends of this classroom know who Joe is. He is now the resident expert on statistical topics.


So, you want to know the true water quality from the sample data. Why don’t you develop the confidence interval of the mean water quality? This will at least give you a range, an interval where the true water quality will be.

 

How can we do that using the sample? Or, maybe I should ask, can you explain what is an interval and how to derive it.

 

Okay. Let’s get some simple/sample data first. NYC Open Data should definitely have data on beaches. Here, they have a link for DOHMH Beach Water Quality Data. The description says that this is the data of water quality sample results collected by the Department of Health and Mental Hygiene from all New York City Beaches. Amazing! Which Rockaway beach are you going to?

 

I usually go to the one on the 95th Street and stay north.

 

We can take the data from ROCKAWAY BEACH 95TH – 116TH then. Let me show you how to write a code in R to extract this subset of the data.

 

 

Hey, I can do that with ease.

 

I forgot to tell you that Jenny is a member of the local girls who code club. She pulls up her Macbook and types a few lines. Meanwhile, Joe does some coding too on the same data.

There are samples that had a result below the detection limit. If I remove those from the ROCKAWAY BEACH 95TH – 116TH data, we are left with 48 data points collected at various times from 2005 to 2018. Look.

 

Wonderful. I am guessing that the blue color triangle in the histogram is the sample mean (\bar{x}).

 

Yes. This is an estimate (good guess) for the average Enterococci Bacteria count for this part of the beach based on 48 samples over several years.

We can describe the uncertainty in this estimate using the confidence intervals. The key link is to realize that this estimate, i.e., the sample mean (\bar{x}) is a random variable. For instance, if we were to take another data sample from different times of the year or over different parts of the beach, we would get a slightly different sample mean. The value of the estimate changes with the change of sample, so we can think of estimate as a range of values or a probability distribution.

 

Yes, that makes perfect sense, but what probability distribution does it follow?

 

The sample mean is given by the equation \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.
x_{i}s are independent and identically distributed random samples. Look at the equation carefully, it is the summation of random variables. Convolution. We learned in Lesson 48 that for a large enough sample size, this summation will converge to the normal distribution — Central Limit Theorem. As the samples grow (n becomes large), convolution or function multiplications yield a smooth center heavy and thin-tailed bell function — the normal density.

 

Ah, I see. So it is reasonable to assume a normal distribution for the sample mean \bar{x}.

 

 

Yes, \bar{x} follows a normal distribution with an expected value E[\bar{x}] and variance V[\bar{x}].

 

E[\bar{x}]=\mu. The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. This is in Lesson 67.

The variance of the sample mean V[\bar{x}] =\frac{\sigma^{2}}{n}. We derived this in Lesson 68. Variance tells us how widely the estimate is distributed around the center of the distribution.

The standard deviation of the sample mean, or the standard error of the estimate is \frac{\sigma}{\sqrt{n}}.

 

So, \bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})

 

Jenny looks at the equation for a bit. She takes a piece of paper, draws on it and instantly types a few lines of code.

 

Here, I am showing this visually.

 

\bar{x} is a normal distribution. It is centered on \mu with a standard deviation of \frac{\sigma}{\sqrt{n}}. One standard deviation range is \mu \pm \frac{\sigma}{\sqrt{n}} , two standard deviations range is \mu \pm 2\frac{\sigma}{\sqrt{n}} and three standard deviations range is \mu \pm 3\frac{\sigma}{\sqrt{n}} .

 

Yup. The standard normal way of saying this is Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1).

 

 

The Z-score!

 

 

Yes, let me ask you something. What is the area under the normal density curve between -1.96 and 1.96?

Looking up from the standard normal tables, we get P(Z \le -1.96) = 0.025, which means the area on the right side of the tail P(Z \ge 1.96) = 0.025. The area between -1.96 and 1.96 is 0.95. 95%.

 

There is a 95% probability that the standard normal variable \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} is between -1.96 and 1.96.

P(-1.96 \le \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \le 1.96) = 0.95

I will modify the inequality in this equation.

Multiplying throughout by \frac{\sigma}{\sqrt{n}}, we get P(-1.96 \frac{\sigma}{\sqrt{n}} \le \bar{x}-\mu \le 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95

Subtracting \bar{x} and multiplying by -1 throughout, we get

P(\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95

We derived that the probability of the true population mean \mu lying between two end-points \bar{x} - 1.96\frac{\sigma}{\sqrt{n}} and \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}} is 0.95.

This interval [\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}] is called the 95% confidence interval of the population mean. The interval itself is random since it is derived from \bar{x}. As we dicsused before, a different sample will have a different \bar{x} and hence a different interval or range.

There is a 95% probability that this random interval [\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}] contains the true value of \mu.

 

Neat. So to generalize this to any confidence interval, we can replace 1.96 with a Z-critical value.

 

Yes. For instance, if we want a 99% confidence interval, we should find the Z-critical value that gives 99% area under the normal density curve.

 

 

That would be 2.58. So the 99% confidence interval for true mean \mu is \bar{x} - 2.58\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 2.58 \frac{\sigma}{\sqrt{n}}.

Usually, this % confidence interval is described using a confidence coefficient 1-\alpha where \alpha is between 0 and 1. For a 95% confidence interval, \alpha = 0.05, and for a 99% confidence interval, \alpha = 0.01. The Z-critical value is denoted using Z_{\frac{\alpha}{2}}.

 

Yes yes. For 95% confidence interval, Z_{\frac{\alpha}{2}} = Z_{0.025} = 1.96.

In summary, we can define a 100(1-\alpha)% confidence interval for the true mean \mu as

[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]

 

Correct. \bar{x} is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean \mu and true standard deviation \sigma. The end-points [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] are called the lower and upper confidence limits.


 

Let us develop the confidence intervals of the mean water quality. The sample I have has n = 48 data points. The sample mean is \bar{x}=7.9125 counts per 100 ml.

.
.
.
Wait, we don’t know the value of \sigma, the true standard deviation. There is still an unknown in the equation.

 

Assume it is 9.95 counts per 100 ml.

 

 

 Where did that come from.

 

 

Take my word for now and get the confidence interval.

 

Jenny reluctantly does some calculations on a piece of paper.

The 95% confidence interval for the mean water quality is

\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96\frac{\sigma}{\sqrt{n}}.

7.9125 - 1.96\frac{9.95}{\sqrt{48}} \le \mu \le 7.9125 + 1.96\frac{9.95}{\sqrt{48}}.

7.064 \le \mu \le 8.761

 

There is a 95% probability that the true mean water quality will be between 7.064 and 8.761. Now tell me where the 9.95 came from.

 

Well, strictly speaking, your statement should have been, “there is a 95% probability of selecting a sample for which the confidence interval will contain the true value of \mu.

 

 

 

Let me explain. 7.064 \le \mu \le 8.761 can be true or can be false depending on the sample we obtained. \bar{x} is a random variable, a probability distribution whose mean is \mu. Our sample mean represents a random draw from this distribution. So, if we got a sample whose mean is close to the truth, the interval [\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}] will contain the truth. If we got a sample who mean is somewhat far away from the truth, its confidence interval may not contain the truth. This graphic should make it more clear. The green interval does not contain the true value. The purple interval does. Both these sample means (\bar{x}) are equally likely; it depends on what sample data we get.

Hmm. So, we should think of this phenomenon as a long-run relative frequency outcome. If we have a lot of different samples, and compute the 95% confidence intervals for these samples, in the long-run, 95% of these intervals will contain the true value of \mu.

 

Exactly. The 95% confidence level is what would happen if a large number of random intervals were constructed; not for any particular interval.

So while you were coding to get the subset of data from the ROCKAWAY BEACH 95TH – 116TH, I downloaded all the data that had ROCKAWAY BEACH in the file. There are eight locations along the beach where these samples are taken. Assuming that these are eight different samples, I developed the 95% confidence intervals for each of these samples. Here is how they look relative to the truth. One of them does not contain the true \mu. Like this, in the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

 

How did you get the true value for \mu?

 

 

Based on the principle of consistency.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0.

As n approaches infinity, the sample estimate approaches the true parameter. I took data from the eight beach locations and computed the overall mean and overall standard deviation. While they are not exactly the true values, based on the idea of consistency, we can assume that they are.

\mu = 8.97 and \sigma = 9.95 counts per 100 ml.

 

There you go. I see why you insisted on using \sigma = 9.95.

 

 

You are welcome!

 

 

But not always we will have such a large sample. Most often, the data is limited. How can we know what the true standard deviation is?

 

In that case, you can use the sample standard deviation. In your sample data case it was 6.96 counts per 100 ml, I think.

 

 

Hmm, but that is very different from the true value. Aren’t we inducing more error into the estimation of the intervals?

 

 

Perhaps. Hey, do you want to take a break? Can I buy you a drink? Is Guinness okay?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 71 – How confident are you?

Switch on the television or open your favorite news website; every few weeks you will find people’s opinion about something. Here’s one that came out a few days back. Our best days are ahead – so they say.

If you are a regular reader here, you would have already asked about the survey method and sample size.

Results for this Gallup poll are based on telephone interviews conducted June 18-24, 2018, with a random sample of 1,505 adults, aged 18 and older, living in all 50 U.S. states and the District of Columbia. For results based on the total sample of national adults, the margin of sampling error is ±3 percentage points at the 95% confidence level.

The margin of sampling error is ±3% points at the 95% confidence level.

Margin of sampling error,” “95% confidence level” – Have you ever asked yourself what these are?

In another news, the state of NJ passed a new “multi-millionaires tax,” Airbnb tax and Uber/Lyft tax among others. I hope the residents of NJ were asked about some of these increases, and this is what they wanted.

If they were to be asked, how do we know how many people to survey?

If a subset of people were to represent the population of NJ, how do we know that the resulting opinion is what the entire population actually wants?

Jill is having fun this summer. He is one of the volunteers at the BIG CITY FISHING SUNDAYS on Pier 84 . He is busy collecting water samples from the Hudson River. This week he computed the mean dissolved oxygen level as 4.9 mg/L. If the acceptable water quality level is a mean dissolved oxygen of 5 mg/L, how reliable is the estimate that Jill got? Can it be deemed acceptable?

In another news, Wilfred Quality Jr. is still thinking about his grandfather’s question: “how confident are you that the true value lies within two standard deviations?”

Jill’s boss is a variance control freak. Can we give him an upper bound on how much the dissolved oxygen in the Hudson River can vary? He usually likes to be 99% confident.

In another news, St Patrick’s Day should remind you of all those pints of Guinness, and “Student.”  No, I am not under the influence.


All these questions can be answered with a little knowledge about how to describe uncertainty in estimates.

To take you back to Lesson 62, the objective of inference is to use a ‘sample’ to compute a number which represents a good guess for the true value of the parameter. The true population parameter is unknown, and what we estimate from the sample will enable us to obtain the closest answer in some sense.

The cliched way of saying this is that the mean (\bar{x}) and variance (s^{2}) of the sample data are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population.

A different sample will yield a different estimate for the parameter. So we have to think of the estimate as a range of values or a probability distribution instead of a single value.

For any estimator \hat{\theta}, we can compute the expected value E[\hat{\theta}] as a measure of the central tendency, and V[\hat{\theta}] as a measure of the spread of the distribution giving us a range of values. Like this.

This range is called an interval. Naturally, the interval will be wider for data that has more variability. We can be confident that the truth may be in this interval if we have good representative samples.

How confident we are is a probability statement. For example, if there is a 95% probability that the interval contains the true value, it is equivalent to saying that this interval is the 95% confidence interval.

The interval will have a stated probability of containing the truth, the degree of plausibility specified by the confidence level → 95% interval or 99% interval.

Over the next several lessons, we will dive into the core concepts of confidence intervals, how to construct them for different population parameters and use them for designing experiments.

Meanwhile, do you know about Guinness and Student? Guinness, I am sure, but are you confident that you know Student?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 70 – The quest for truth: Learning estimators in R, part II

T-2 days

Wilfred Jr. was typing away on his Macintosh. Keneth is writing something in his notebook.

“Do you have an estimate yet?” asked Wilfred. “In a minute.”

Wilfred Jr. and Keneth were busy preparing for the meeting in a couple of days. They are systematically analyzing the data they collected over the past few weeks. Their workplace was well lit with a strong beam of sunlight through the window. They both looked out, it was deceivingly pleasant. It is one of those newsworthy summer days. The local weatherman has already given it a rank 2 in his “new normals” show.

The wall behind the big screen computer is a full corkboard that has sticky notes and pins.

“We have an answer for the first two questions,” said Ken. “Last week, we checked this and found that the true proportion of dummies is somewhere around 2% – 0.02024.

“There you go. Look at this.” Wilfred hit Command + Enter on his keyboard. “I like this version better,” he said.

Keneth looked at the code and wondered if Willy would explain to him what was going on.

“So, we are showing how the estimate for the proportion of dummies  \hat{p}=\frac{r}{n} changes as we add more data from the stores. The more the data (i.e., the stores we visit), the closer the estimate goes to the truth. After a while, the change in the value becomes small and the estimate stabilizes and converges to the truth. It is the consistency criterion of the estimates,” said Wilfred as he points out to the blue line on the graph.

“What are the red dots and the boxplot that shows up at the end?” asked Keneth.

“Good questions. Each red dots that appears vertically at the end is the estimated proportion for a given store. So, if a store has 2% dummies, the estimate for that store is 0.02, and a red dot appears at 0.02. Like this, we have 1000 different sample estimates for the proportion of dummies. The boxplot at the end is showing this distribution. The distribution of the sample proportions. Did you notice anything?” asked Wilfred.

Ken looked at the plot again, this time with more attention.

“The blue line meets the big red circle. I can see that the big red circle is the mean of the red dots. E[\hat{p}]. The boxplot also looks asymmetric.” The expected value of the sample proportions and the long-run estimate are the same.”

E[\hat{p}]=p=0.02024

Wilfred nods in acceptance. “We discussed this last week. Since the estimate \frac{r}{n} is an unbiased estimate of the true proportion, the mean of the sample distribution will be the true proportion. The long-run converged estimate is also the same value.”

“Can you share the code with me? I want to take some time and learn how to make these visuals.” Keneth is imagining how cool he would look in front of his friends when he presents his summer project.

“Sure. Here it is. I wrote comments for each line. It should be self-explanatory.”

# Graphic analysis for proportion #

# For each store, compute the sample proportion #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample proportion #
# This will converge and stabilize at the true parameter value #

# Reading the candy data file #
candy_data = read.table("candy_samples.txt",header=F)

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_proportion = NULL # as the sample increases, we re-compute the sample proportion of dummy candies

sample_p = matrix(NA,nrow=nstores,ncol=1) # saving the estimate from each store. The expected value of this vector = truth

plot(0,0,col="white",xlim=c(0,1300),ylim=c(0,0.08),xlab="Stores",ylab="Proportion of Dummies",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
p = length(dummies)/(ncandies*i)
sample_proportion = c(sample_proportion,p)

# storing the sample weight for the store in a matrix 
sample_p[i,1] = length(dummy_index)/ncandies

# plot showing the trace of the sample proportion -- converges to truth
points(i,sample_proportion[i],cex=0.075,col="blue")
points(1000,sample_p[i],cex=0.1,col="red")
Sys.sleep(0.05)
#print(i)
}

Sys.sleep(1)
points(1000,mean(sample_p),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_p,add=T,at=1050,boxwex=25,range=0)
Sys.sleep(1)
text(1200,mean(sample_p),expression(E(hat(p))==0.02),col="red",font=2)
What are the population characteristics?

Their attention now shifted to the final question. When the day started, they discussed some ideas and decided they will present the analysis that investigated for the true mean and the true variance of the population.

If the weight of all the Quality Candies is the population, then, it can be represented using the mean (\mu) and the variance (\sigma^{2}) of the population.

Willy and Ken’s task is to find these values. Estimate the parameters.

Here is what they did.

They visited the first shop and found that the estimated proportion of dummy candies in 0.02. As Ken separated the candies from the dummies, he also put them on a weighing scale and measured the weight of each candy.

The average weight of the 98 candies from the first store is 3.00748 grams.

Ken estimated this using the formula \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. This is the maximum likelihood estimator for the true mean.

They record this as the estimated sample mean for the first store.

They went to the second store. Here they did not find any dummies. The estimated sample mean weight of the 100 candies from this store is 3.0073 grams. This is another estimate of the true mean.

Overall, they now have 198 candies (a bigger sample). Ken also computed the sample mean for these 198 candies. This comes out as 3.007389 grams.

Like this, as they visited more and more stores, they re-computed the average weight of the candies using a larger sample. They also computed the individual estimate of the sample mean for that store.

Like in the case of the proportion of dummies, they want to check the convergence of the estimated mean.

“I think you can use the same code as before, but change the lines where you computed the proportions to computing the sample means,” said Ken as he is seen knee-deep into the lines of the code.

“You are correct! All I did was that and here we go. I changed the color of the converging line.”

“I will interpret this now,” said Ken. “The sample mean  \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean  \mu . I know it from Lesson 67. The big red circle is the expected value of the sample means, the center of the distribution. So this is E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]. This is equal to the true mean \mu. The orange line, the estimate with increasing sample size also converges to this number.”

E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}] = \mu = 3

“Excellent. What else do you observe? Anything from the boxplot?”

“Hmm, it does look symmetric. Is the distribution of the sample means normally distributed?” asked Ken.

“I will let you mull over that. You do know about the Central Limit Theorem, right?” Wilfred finished writing some comments in his code.

# Graphic analysis for mean #

# For each store, compute the sample mean #
# The expected value of these estimates is the true parameter # 
# With each store visited, we can recompute the sample mean #
# This will converge and stabilize at the true parameter value #

nstores = ncol(candy_data) # number of stores 
ncandies = nrow(candy_data) # number of candies in a pack

candies = NULL # an empty vector that will be filled with actual cadies data from each store
dummies = NULL # an empty vector that will be filled with dummy candies data from each store

sample_meanweight = NULL # as the sample increases, we re-compute the sample mean of the candies

sample_mean = matrix(NA,nrow=nstores,ncol=1) # save the value of sample mean for each store

plot(0,0,xlim=c(0,1200),ylim=c(2.945,3.055),xlab="Stores",ylab="Average Candy Weight",font=2,font.lab=2)

for (i in 1:nstores)
{
# sample pack of candies and the index of dummies in the pack
candy_pack = candy_data[,i]
dummy_index = which(candy_pack < 1.5)

# separating candies from the dummies in each pack 
sample_dummies = candy_pack[dummy_index]

if(sum(dummy_index)>0){sample_candies = candy_pack[-dummy_index]}else{sample_candies = candy_pack}

# recording the weights on candies and dummies -- this vector will grow as we visit more stores, i.e., collect more samples 
dummies = c(dummies,sample_dummies)
candies = c(candies,sample_candies)

# computing the mean weight of the candies -- this vector will grow .. each incremental value is a better estimate with greater sample size. 
xbar = mean(candies)
sample_meanweight = c(sample_meanweight,xbar)

# storing the sample weight for the store in a matrix 
sample_mean[i,1] = mean(sample_candies)

# plot showing the trace of the sample mean -- converges to truth
points(i,sample_meanweight[i],cex=0.15,col="orange")
points(1000,sample_mean[i],cex=0.15,col="red")
Sys.sleep(0.05)
print(i)
}

Sys.sleep(1)
points(1000,mean(sample_mean),pch=19,col="red",cex=2)
Sys.sleep(1)
boxplot(sample_mean,add=T,range=0,boxwex=25,at=1050)
Sys.sleep(1)
text(1150,mean(sample_mean),expression(E(hat(mu))==3),col="red",font=2)
“What about Variance?”

Asked Wilfred.

“Same procedure. In fact, when I computed the sample mean for each store and the overall incremental sample, I also computed the sample variances using its maximum likelihood estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. I used this formula for each store and the overall incremental sample.”

“Give me a few minutes, I will code this up and show.” Ken looked excited. He finally got to write his code and look at the outputs. He typed up a few lines, mostly editing the previous versions.

Here is what they found.

Ken looked puzzled. He expected to see the same pattern, however, much to his surprise, he found that the expected value of the sample variance E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}] and the long-run converged estimate for the variance are not the same.

E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}] \neq \sigma^{2}

“Is there a bias?” he asked.

Wilfred was expecting this. He had done his homework from Lesson 67!

“Yes, there is.”

“The maximum likelihood estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} is a biased estimator.”

“Let me show you why.”

“I will rewrite the estimator as \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}} by using \bar{x} in place of \mu.”

\hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}

=\frac{1}{n}\sum(x_{i}^{2} + \bar{x}^{2} -2\bar{x}x_{i})

=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} \sum \bar{x}^{2} - \frac{2}{n}\sum x_{i}\bar{x}

=\frac{1}{n}\sum x_{i}^{2} + \frac{1}{n} n \bar{x}^{2} - 2 \bar{x} \bar{x}

=\frac{1}{n}\sum x_{i}^{2} + \bar{x}^{2} - 2 \bar{x}^{2}

=\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}

\hat{\sigma^{2}}=\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})

“Now, let’s apply the expectation operator on this formula.”

E[\hat{\sigma^{2}}]=E[\frac{1}{n}(\sum x_{i}^{2} - n\bar{x}^{2})]

=\frac{1}{n}(\sum E[x_{i}^{2}] - n E[\bar{x}^{2}])

=\frac{1}{n}(\sum (V[x_{i}]+(E[x_{i}])^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))

“I am sure you know why I wrote that step. It is from the property of variance operator.”

“Now, since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. I will use this simple idea to reduce our equation.”

E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (V[\bar{x}]+(E[\bar{x}])^{2}))

“If you go back and read Lesson 68 (Bias joins Variance), you will see that V[\bar{x}]=\frac{\sigma^{2}}{n}. You already know that E[\bar{x}]=\mu. Let’s substitute those here.”

E[\hat{\sigma^{2}}] = \frac{1}{n}(\sum (\sigma^{2}+\mu^{2}) - n (\frac{\sigma^{2}}{n}+\mu^{2}))

E[\hat{\sigma^{2}}] = \frac{1}{n}(n\sigma^{2} + n\mu^{2} - \sigma^{2} -n\mu^{2})

E[\hat{\sigma^{2}}] = \frac{n-1}{n}\sigma^{2}

“So, \hat{\sigma^{2}} is biased by -\frac{\sigma^{2}}{n}.”

“That is why you are seeing that the expected value of the sample variances is not the same as where the overall incremental estimate (population variance) converges. The estimator with n in the denominator underestimates the true variance \sigma^{2} because of the negative bias.”

If you want an unbiased estimator, you should use \frac{1}{n-1} in the denominator instead of n.

The unbiased estimator for variance is \hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}

Wilfred added a few lines in the code and showed this plot to Ken.

“Look at this. I have added another boxplot to your figure. This boxplot is the distribution of the bias-adjusted (unbiased) variance estimates. The green dot is now the expected value of this distribution, which matches with the true variance.”

Ken looks at it with a happy face, but something seems to bother him.”

“What is it?” asked Jr.

“Is the distribution of variance asymmetric?” he asked.

“Yes. Is that all?”

“Well, if the variance is bias-adjusted to match the truth, does it also mean the standard deviation, which is \sqrt{\hat{\sigma^{2}}}, will get adjusted?”

“That is a good question.”

No, although \hat{\sigma^{2}}=\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}} is an unbiased estimator for \sigma^{2}, \hat{\sigma}=\sqrt{\frac{1}{n-1}{\displaystyle \sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}} is not an unbiased estimator for \sigma.

“Here, I re-ran the code with some changes to show this. The adjusted estimate still has some bias.”

“Very interesting. I guess we are ready for the meeting. The true mean weight of the candies is 3 grams and the true standard deviation of the weight of the candies is 0.2 grams. Can we have the code in the archives.” Ken starts packing his bag as he says this. He looks exhausted too.

“Here is the full code.”

T-0 days

Mr. Alfred Quality and Wilfred Quality Sr. were disappointed to know that there is a systematic packing error, although it is only 2%. They deployed their engineering team to fix this immediately.

They were nevertheless very impressed with the quality work of Wilfred Quality Jr. He owned the problem and came up with the answers. All along, he had collected a large sample that can be used for future tests, trained an enthusiastic high-school senior and showed that he is ready to be the next Mr. Quality for Quality Candies.

As he was leaving the boardroom, Mr. Alfred Quality asked him:
“Junior, great work, but how confident are you that the true value lies within two standard deviations?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 69 – The quest for truth: Learning estimators in R, part I

For the fifteenth time, Mr. Alfred Quality looks at the two candies in front of him and shakes his head. He is visibly disgruntled. He spent 50 years selling quality candies in North Dakota. The kids love it. Quality Candies are local, and classic North Dakotan. Now, this new “Classic Series Pineapple Candy” seems to have deviated from his expectations.

There is a knock on the door. “That must be Wilfred,” thought Mr. Quality in his mind as he speaks in an elevated voice, “Come on in.”

Wilfred Quality, son of Alfred Quality gently opens the doors and walks in. He is well dressed and looks around 50 years old. He could notice the displeasure of his father as he steps closer.

“Son, I have noticed a lapse in quality in this Classic Series Candy. There is a candy with no candy. What would our customers think? We need immediate action on this. I want to know what proportion of our candy in the stores are dummies. We have to fix this before the customers accuse us of cheating by filling in empty candies. Is this a packing machine error, or is it a trick to adjust for inflation?”

“I will look into it. It must be a machine error. There would be no such inflation adjustment tricks, I assure you,” said Wilfred who is the current CEO of Quality Candies. He is running the business very efficiently since his father took golf seriously.

“I will ask Junior to collect the data and give us the answers,” he said.

∞∞∞

Wilfred Quality Jr. is an enthusiastic lad. He is always in the quest for the truth. This summer, he is interning at one of his father’s offices in Hillsboro. He was unhappy at the quality issue and was determined to find the answers. In his mind, he is also thinking this would make a neat little data analysis project.

His father and grandfather gave him three questions to investigate.

What proportion of their candies is wrongly packed – dummies?

Is it a one-off error or is it systematic?

What is the true nature of our candies, i.e., what are the population characteristics?

He had 30 days, enough monetary resources (it’s his company after all), and Rstudio at his disposal.

∞∞∞

For the next 25 days, he spent his time visiting around 40 local stores a day. From each store, he purchased a pack (100 candies) of “Classic Series Pineapple Candy.”

He also hired Keneth, the kid next door to help him with some experiments. Keneth’s job was to take the candies out of the candy pack and separate the candies from the no candies, like this.

He should also weigh them and enter the readings into a data table. Some of those candies become his prizes after he is done with the experiment.

Wilfred Jr. then analyzes them in RStudio.

All in all, they visited 1000 stores and prepared this data file. The data file is a matrix with 100 rows and 1000 columns. Each column is the data of the weight of the candies from each store.

They take the data from the first store and make a simple dot chart.

“See, the two dummies I found,” said Ken after looking at the plot on Wilfred’s screen.

So that is like 2 in 100, a 2% error. Let’s take the second store and see what comes up,” said Wilfred.

“Hmm. There are no dummies in this sample. That’s 2 in 200 then – 1% error.

“Should we plot a few stores and see if there are variations?” said Ken.

“Yes, we can do that. Let me take the first 12 stores and make the plots.” Wilfred types the following lines in RStudio and gets the plot.

# increasing number of stores # 
par(mfrow=c(4,3))
for (i in 1:12)
{
store_sample = candy_samples[,i]
stripchart(store_sample,font=2,cex=1.25,pch=21,cex.axis=1.5)
}

They see some variation. Some candy samples have dummies, some do not. “It looks like the dummies are not a one-off incident. It is a pattern. There is some proportion that are dummies,” thought Wilfred.

They decided to dig deeper.

They also decided to use around 1.5 grams as a cut off weight to screen for dummies.

If the weight is less than 1.5 grams in a sample, they will call it a dummy and compute the proportion of samples that are dummies.

Wilfred Jr. knows that the maximum likelihood estimator of p, the true proportion of dummies \hat{p}=\frac{r}{n}.

r is the number of dummies in a sequence of n independent and identically distributed random sample. He is using this knowledge for estimating the proportion of dummies.

The more the samples, the better his estimate will be.

Each store he visits gives him a new sample. The total number of candies keep increasing. The more samples he gets, the closer he goes to the true population.

As the samples increase, the estimate gets closer to the truth.

“Let’s compute two estimates. \hat{p_{1}} as something that keeps changing as we add more samples; getting better as n increases. \hat{p_{2}} as the estimate for the store. Since we are visiting 1000 stores, \hat{p_{2}} is a distribution of size 1000. It represents the overall distribution of the proportion of dummies across all our stores.”

Wilfred seems to be speaking to himself. Ken was only interested in counting the candies

Seeing him indifferent, Wilfred showed him the following video on his screen.

Ken’s face lit up. “It is wobbly at the beginning, but seems to stabilize at 0.02,” he said.

“Yes, as the sample size is increasing, the estimated proportion of dummies is better represented. So, the estimate is closer to the true estimate. It looks like our true proportion of dummies is somewhere around 2%. 0.02024 to be precise.”

This is the consistency criterion of the estimates.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0

Estimators are a function of sample size. As n approaches infinity, the sample estimate approaches the true parameter.

Fisher put it this way.

Wilfred also showed another graphic, the distribution of \hat{p_{2}}.

This is the frequency distribution of the estimated proportion from each sample. For example, \hat{p_{2}} for the first store is 0.02, for the second store is 0 and so on.

Wilfred pointed out that the estimate \frac{r}{n} is an unbiased estimate of the true proportion of dummies. The distribution of the estimate \hat{p_{2}} will be centered on the true proportion p.

He is showing the expected value of the distribution, 0.02024, as the red triangle in the plot.

E[\hat{p_{2}}] = 0.02024, the same value we got from the large sample (population).

While there is variability from store to store, the average proportion of dummies is 0.02024.

Keneth is now curious about how to do these calculations and make such visuals.

Wilfred showed him his code.

# Proportions #

nstores = ncol(candy_samples) # number of stores 
ncandies = nrow(candy_samples) # number of candies per store

increase_sample = NULL # an empty value that will be filled as more an more samples --> approaches the population. 
n = NULL # an empty value that will be filled by growing sample size.

phat1_bad = NULL # proportion changes over samples... and stabilizes: Consistency or Ergodicity
phat2_bad = matrix(NA,nrow=nstores,ncol=1) # proportion value for each store. The expected value of this vector should be the population proportion E[phat]

for (i in 1:nstores)
{
store_sample = candy_samples[,i]
increase_sample = c(increase_sample,store_sample)
n = c(n,length(increase_sample))

p1 = length(which(increase_sample<1.5))/n[i]
phat1_bad = c(phat1_bad,p1)


p2 = length(which(store_sample<1.5))/ncandies
phat2_bad[i,1] = p2
}

# plot showing the consistency property #
sam = seq(from=1,to=1000,by=20)
for(i in 1:length(sam))
{
plot(0,0,col="white",xlim=c(0,100000),ylim=c(0.005,0.03),font=2,cex.lab=1.25,font.lab=2,xlab="Samples Size",ylab="Proportion")
lines(n[1:sam[i]],phat1_bad[1:sam[i]])
txt = paste("n = ",n[sam[i]])
text(10000,0.03,txt,col="red")
Sys.sleep(0.5)
}

# Histogram of phat2_bad # 
hist(phat2_bad,font=2,main="",xlab="Proportion",font.lab=2,col="grey")
points(mean(phat2_bad),0.0001,pch=17,cex=2.5,col="red")

While Ken and Wilfred start exploring the candy characteristics – true nature, you can set up today’s experiment yourself and understand the property of consistency and bias more closely using the data. See you in one week. Happy Coding.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 68 – Bias joins Variance

In the last week’s “guess the age” contest, Madam Chairman Yellen was the clear winner. The expected value of the estimated age was 69.69 years, a nominal bias of 1.31 years from her true age.

However, the great Thomas Sowell is the clear winner of our hearts. May he always remain a generation young, like his ideas.

Look at the table once again.

Can you notice the variation in the estimates of the groups?

Yes, Chairman Yellen’s estimated age ranges from 62 years to 78 years, while Dr. Sowell’s estimated age ranges from 55 years to 70 years.

Based on the data from the 16 groups, I computed the variance (and the standard deviation) of the estimated ages.

For Chairman Yellen, the variance of the estimated age is 22.10 and the standard deviation of the estimated age is \sqrt{22.10}=4.7 years.

For Dr. Sowell, the variance is 23.98 and the standard deviation is \sqrt{23.98}=4.9 years.

The variance of the estimator is an important property. It tells us about the spread of the estimate, i.e., how widely the estimate is distributed.

For any estimator \hat{\theta}, we compute the expected value E[\hat{\theta}] as a measure of the central tendency and how far it could be from the truth – Bias, and V[\hat{\theta}] as a measure of the spread of the distribution.

An estimator should be understood as a probability distribution.

The standard deviation of the estimator is called the standard error of an estimator. In our age guessing experiment, the standard error of Dr. Sowell’s estimated age is 4.9 years.

The standard error of \hat{\theta} is \sqrt{V[\hat{\theta}]}.


In lesson 67, while learning the concept of bias, we deduced that the sample mean \hat{\mu} = \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

In other words, we derived that E[\hat{\mu}] = \mu .

Now, let’s derive the variance of the sample mean V[\hat{\mu}] or V[\bar{x}].

For this, let us consider that we have a random sample x_{1}, x_{2}, ..., x_{n} of size n from a population with a mean \mu and variance \sigma^{2}.

Since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}.

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean.

V[\hat{\mu}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

V[\hat{\mu}] = V[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

V[\hat{\mu}] = \frac{1}{n^{2}}V[(x_{1}+x_{2}+...+x_{n})]

Where did that come from? Was it lesson 27? V[aX]=a^{2}V[X]

V[\hat{\mu}] = \frac{1}{n^{2}}(V[x_{1}]+V[x_{2}]+...+V[x_{n}])

Because the samples are independent, the covariance terms do not exist in the summation.

 V[\hat{\mu}] = \frac{1}{n^{2}}(\sigma^{2}+\sigma^{2}+...+\sigma^{2})=\frac{1}{n^{2}}*n \sigma^{2}=\frac{\sigma^{2}}{n}

The variance of the sample mean V[\hat{\mu}] = V[\bar{x}] =\frac{\sigma^{2}}{n}.

The standard error of the sample mean is \frac{\sigma}{\sqrt{n}}. It is a measure of the precision of the estimator.


Joe has a question

Do you remember last week’s dilemma?

Joe wants to choose a path; an estimator for the true population mean.

We tried to help him by computing the bias of each of the three estimators. The idea was to choose the one that has no bias. But it turned out that all three estimators are unbiased.

How then, can we select the best estimator?

Look at the estimators again.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

I told you that estimators should be understood as probability distributions and that we can compute the expected value E[\hat{\theta}] for the bias, and V[\hat{\theta}] to measure of the spread of the distribution.

Let’s compute the variance of these three estimators.

V[\hat{\mu_{1}}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

\hat{\mu_{1}} is the sample mean. We just derived its variance as V[\hat{\mu_{1}}] = \frac{\sigma^{2}}{n}.

So let’s work on the second estimator.

V[\hat{\mu_{2}}] = V[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}]

V[\hat{\mu_{2}}] = V[ x_{[1]} + \frac{1}{2}x_{[n]} - \frac{1}{2}x_{[1]}] = V[\frac{1}{2}x_{[1]} + \frac{1}{2}x_{[n]}]

V[\hat{\mu_{2}}] = \frac{1}{4}\sigma^{2} + \frac{1}{4}\sigma^{2}

V[\hat{\mu_{2}}] = \frac{1}{2}\sigma^{2}

The variance of the third estimators \hat{\mu_{3}} will be \sigma^{2} if the sample size is odd-numbered, or \frac{1}{2}\sigma^{2} if the sample size is even-numbered.

So, the three estimators \hat{\mu_{1}}, \hat{\mu_{2}}, \hat{\mu_{3}} are all unbiased. Their variances are as follows:

V[\hat{\mu_{1}}]=\frac{\sigma^{2}}{n}

V[\hat{\mu_{2}}]=\frac{\sigma^{2}}{2}

V[\hat{\mu_{2}}]=\sigma^{2} or \frac{\sigma^{2}}{2}

The three estimator probability distributions are centered on the truth (\mu), but their spreads are different. If the sample size n is greater than 2 (n>2), estimator \hat{\mu_{1}} has the lowest variance.

They will look like this visually.

Among all the estimators that are unbiased, we choose the one which has minimum variance. \hat{\mu_{1}} is the minimum variance unbiased estimator. It is more likely to produce an estimate close to \mu, the truth.

Among all estimators which are unbiased, choose the one which has minimum variance. This chosen \hat{\theta} is called the minimum variance unbiased estimator (MVUE) of \theta, and it is most likely among all unbiased estimators to produce an estimate close to the truth. This principle is called the principle of minimum variance unbiased estimation.


You must be wondering what if an estimator has low variance, but is biased. Like this.

Isn’t \hat{\theta_{1}} more likely to produce an estimate close to the truth?

Perhaps.

So there should be a way to combine bias and variance into one measure to help us with the selection.

The mean squared error is that measure. Let’s see how to derive this measure.

We have an estimator \hat{\theta} for the true paramter \theta.

The error of this estimator from the truth is \hat{\theta}-\theta.

The squared error is (\hat{\theta}-\theta)^{2}.

The mean squared error (MSE) is the expected value of the squared error.

E[(\hat{\theta}-\theta)^{2}]

MSE = E[(\hat{\theta}-\theta)^{2}]

MSE = V[\hat{\theta}-\theta] + (E[\hat{\theta}-\theta])^{2}

Can you tell how we wrote the above expression?

MSE = V[\hat{\theta}] + V[\theta] + (E[\hat{\theta}] - E[\theta])^{2}

MSE = V[\hat{\theta}] + (E[\hat{\theta}] - \theta)^{2}

This is because, E[\theta]=\theta and V[\theta]=0.

V[\hat{\theta}] is the variance of the estimator.

E[\hat{\theta}] - \theta is the bias of the estimator.

So, MSE = Variance + Bias^{2}

For unbiased estimators, the mean squared error is equal to the variance of the estimator.

We can use this measure to compare two estimators. If MSE(\hat{\theta_{1}})<MSE(\hat{\theta_{2}}) we can say that \hat{\theta_{1}} is a better estimator, a more efficient estimator in producing the truth.

The ratio \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is called relative efficiency of \hat{\theta_{2}} to \hat{\theta_{1}}.

If you can create an estimator that has low variance, but you have to compromise on the bias, you can go for it, as long as the reduced variance is greater than the squared bias.

In estimating true parameters, there is always a tradeoff between bias and variance.

To close off for the week, let me ask you three questions.

If \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is the relative efficiency, what is consistency?

Is the estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} unbiased? Did you do your homework?

When was the last time you were on RStudio?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 67 – Bias

“Nooooo,” “No way…” “Ooooh,” “Yessss,” “I told ya,” “I knew it.”

These are some of the funny reactions from our high-school students when the true age is finally revealed; because the bias in their estimation is also revealed.

Every year, I start off the data analysis boot camp with this game from the bag of tricks for teaching statistics by Andrew Gelman.

The students sit in groups of four and look at rolling pictures to guess their ages. They have their own formulas (estimators) for the age!

What you saw in the opening image are the guessed ages of the former Chairman of the Federal Reserve, Jannet Yellen and the great economist Thomas Sowell.

Do you see anything peculiar?

Yes. The estimated age of Chairman Yellen is somewhat close to her true age (as per Wikipedia) and the estimated age of Dr. Sowell is not even close (again, as per Wikipedia).

Let’s compute the average of all the guesses, the expected value of the estimated age.

For Chairman Yellen it is 69.69 years, a difference of 1.31 years from her true age.

Pretty close to the truth. The students’ estimate has very little bias.

For Dr. Sowell, it is 61.88 years, a difference of 25.13 years from his true age.

Far from the truth. The students’ estimate is biased.

Let’s spend a little more time on the estimated ages.

There are 16 groups that guessed the ages based on a picture of the person. Let us use a notation \hat{\theta} for this estimate. So, after heavy thinking and vehement deliberation within the group, the estimate from group A is \hat{\theta}=73 for Chairman Yellen.

Group B is also filled with thinkers and negotiators, so their estimate for Chairman Yellen’s age is \hat{\theta}=70.

Likewise, the 16 groups provided a distribution of estimates. The expected value E[\hat{\theta}] of this distribution is 69.69 years.

The true age \theta is 71 years.

The bias is E[\hat{\theta}]-\theta=-1.31

Look at this visual. I am taking the 16 estimates and plotting them as a histogram (or a probability density). The true age is also shown in the histogram. The estimates are distributed about the true age. The mean of this distribution is very close to 71.

Dr. Sowell on the other hand, somehow seemed younger in the minds of the high-school students. The expected value of the estimate E[\hat{\theta}] is 61.88 years. The true age is 87 years, a bias (E[\hat{\theta}]-\theta) of around 25 years.The bias could be because of the picture I am using, his pleasant smile or his aura!

This idea of measuring the closeness of the estimate to the true parameter is called the bias of the estimate.

An estimator (\hat{\theta}) (i.e., a formulation to compute the true population parameter) is unbiased if on average it produces values that are close to the truth. E[\hat{\theta}]=\theta. In other words, if we have a distribution of the estimates, and if their mean is close (or centered) to the truth, the estimator is unbiased.

♦Θ♦


In lesson 63, we learned that the maximum likelihood estimator of p, the true probability of success for a Binomial distribution is \hat{p}=\frac{r}{n}, where r is the number of successes in a sequence of n independent and identically distributed Bernoulli random numbers. Do you think \hat{p} is unbiased?

Let’s check out.

The bias of \hat{p} is defined by E[\hat{p}]-p.

E[\hat{p}]=E[\frac{r}{n}]=\frac{1}{n}E[r]=\frac{1}{n}*np=p

The key point in the above derivation is that the expected value of the number of successes (r) for a binomial distribution is np.

The estimate \hat{p}=\frac{r}{n} is an unbiased estimate of the true probability of success.

The distribution of the estimate \hat{p} will be centered on the true probability p.

§⇔§

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Do you think they are unbiased?

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean. Let’s compute the bias of this estimate.

E[\hat{\mu}] = E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

E[\hat{\mu}] = E[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}E[(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}(E[x_{1}]+E[x_{2}]+...+E[x_{n}])

Since x_{1}, x_{2}, ..., x_{n} are random samples from the population with true parameters \mu and \sigma, E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. In other words, since x_{i}s are independent and identically distributed, they will have the same expected value and the variance as the population.

So, E[\hat{\mu}] = \frac{1}{n}(\mu+\mu+...+\mu)=\frac{1}{n}*n \mu=\mu

The sample mean \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

What about the variance estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}?

You know the chore now. Compute the bias of this estimator and find out for yourself. Passive minds have to wait until next week.

♦Θ♦


Joe is a smart chap. The other day, he asked me if he could use any other estimator to get the population mean (truth). I probed him further, and he came up with three estimators to estimate the true mean \mu.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

The first estimator \hat{\mu_{1}} is known to all of us. It is the sample mean.

The second estimator \hat{\mu_{2}} is computing the central tendency as the sum of the smallest value x_{[1]} and half the range of the data \frac{x_{[n]}-x_{[1]}}{2}. It is a reasonable way to tell where the center will be.

The third estimator \hat{\mu_{3}} is the median of the sample.

Can you help Joe choose the correct path?

Your immediate reaction would be, “compute the bias for each estimator and select the one that has the smallest bias.”

Fair enough. Let’s try that then.

We already showed that the first estimator \hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is unbiased.

Employing the bias computation for the second estimator, we can see that

E[\hat{\mu_{2}}] = E[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}] = E[x_{[1]}] + \frac{1}{2}(E[x_{[n]}]-E[x_{[1]}])

E[\hat{\mu_{2}}] = \mu + \frac{1}{2}(\mu-\mu)=\mu

The second estimator is also unbiased.

Now, depending on whether there is even number of samples or an odd number of samples, you can derive that the third estimator is also unbiased.

How then, can Joe select the best estimator?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 66 – Spare a moment

Hello! Do you have a moment?

What is the first thing that comes to your mind when you hear the word “moment?”

For some of you, it is “let me capture and share the moment.”

For some of you, it is the coffee cup cliches — “live in the moment,” “enjoy every moment,” and “be happy for this moment.”

Some of you may actually be believing in the “power of the present moment” or “don’t wait for the perfect moment, take the moment and make it perfect” stuff!

How many of you remembered torque from your high school physics lesson?

The torque of a force, or moment of a force about the axis of rotation?

\Gamma = \textbf{r} * \textbf{F}

Moment of a force is the product of the force and its distance from an axis. It is a measure of the tendency of the force to cause a body to rotate.

I am sure you are thinking that the moment you spared for me is not worth it. You don’t want to recollect those moments from the horrid highschool days.

Bear with me for a few more moments. You will see why we are talking moments in data analysis.

You know by now that we can use a sample to estimate the true value of the parameter of the population. The formulation used to get that estimate is called an estimator, and there are methods of estimation of these parameters. In lesson 63, lesson 64 and lesson 65, we dabbled with the method of Maximum Likelihood Estimation (MLE).

Another method of estimating population parameters from the sample is the Method of Moments.

Let’s see how the idea of the moment from physics is related to the Method of Moments.

Assume that we have a sample of five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} from a population with a probability density function f_{X}(x). Let’s place them on a number line like this.

Imagine that each number is represented by an equal-weighted ball. Since each data point is independent and identically distributed, i.e., equally likely, it is reasonable to assume that the probability of their individual occurrences (\frac{1}{n}) is the mass of the data point.

For any data point, the torque, or the moment about the axis of rotation is \Gamma = \textbf{r} * \textbf{F}.

For example, a data point at a distance of x_{i} from the axis has a moment of x_{i}*\frac{1}{n}. I am dropping the standard gravity term, and you will see why shortly.

Since there are more than one forces acting on the sample number line, we can compute the total torque acting as \Gamma = {\displaystyle \sum_{i=1}^{n}r_{i} * F_{i}}.

In other words, the total moment of the sample data about the axis of rotation is {\displaystyle \sum_{i=1}^{n}x_{i}*\frac{1}{n}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.

Did you notice that this is the equation for the sample mean?\bar{x}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. The centroid of the data points.

The population has a probability density function f_{X}(x).

We saw in Lesson 64 on how to divide the space into equal intervals of length dx. The probability that a given sample data point x_{i} falls in the range dx is f_{X}(x_{i})dx for an infinitesimally small range dx. Like in the sample case, we can assume this is the mass of any subdivision. Hence, the moment of this infinitesimal body about the rotation axis is x_{i}*f_{X}(x_{i})dx.

For the entire population, the total moment is {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}.

I know. You will tell me that this is the expected value E[X], or the centroid of the probability density function.

E[X] = {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}

If the five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} are originating from a population with a probability density function f_{X}(x), it is reasonable to equate the total moment of the sample data to the total moment of the population data. The centroid of the sample is equal to the centroid of the probability density functions.

{\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} =\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


Let me pester you again!

What is the moment of inertia?

No worries. Let’s activate your deep memory neurons. Imagine that the sample data points are rotating on the plane about the axis with a velocity v = r\omega, where \omega is the constant angular velocity of the plane. In our example of five data points, the velocity of any given point will be v_{i} = x_{i}\omega.

If the body is rotating, it will be subject to a tangential force of m\frac{dv}{dt}. For our sample point which is at a radial distance of x_{i} from the axis, the tangential force will be \frac{1}{n}\frac{dv}{dt} = \frac{1}{n}x_{i}\frac{d \omega}{dt}.

To simplify life, let’s call \frac{d \omega}{dt}=\alpha, the angular acceleration of the point.

The tangential force is \frac{1}{n}x_{i}\alpha, and the torque of the tangential force applied on one point is x_{i}*\frac{1}{n}x_{i}\alpha = \frac{1}{n}x_{i}^{2}\alpha.

The total torque of all the tangential forces = {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}\alpha}.

The term {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}} is called the moment of inertia of the points.

Just looking at the equation, you can say that the moment of inertia of points far apart is larger than the moment of inertia of the points closer to the axis.

This moment of inertia is called the second moment and it gives us a sense of how spread out the data points are about the axis.

Sample moment of inertia = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

By extension to the population, we can write the population moment of inertia, or the population second moment as

E[X^{2}]={\displaystyle \int_{- \infty}^{\infty}x^{2}f_{X}(x)dx}.

Notice the generalization E[X^{2}] for the second moment, just like E[X]={\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} is the first moment.

Population moments are the expected values of the powers of the random variable X.

I can sense that you are charged up by the word “spread.” Yes, the second moment will be used to get the second parameter, the variance of the distribution.

We can equate the sample moment of inertia to the population moment of inertia.


E[X^{2}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}


This form of equating sample moments to the population moments is called the Method of Moments for parameter estimation.

It was developed by Karl Pearson in the early 1900s. Pearson demonstrated this method by fitting different forms of frequency curves (probability density functions) to data by calculating as many moments from the sample data as there are parameters of the probability density function.

A generalized representation of the Method of Moments can be done like this.


E[X^{k}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{k}}


To estimate the k parameters of the probability density function, we can equate the k population moments to the k sample moments.

If the probability density function has one parameter, the first moment equation is sufficient. If the function has two parameters, we have the first and the second moment equations — two unknowns, two equations to solve. If the function has three parameters, we go for three moment equations.

Time for some action

Let’s solve for the parameters of the normal distribution.

There is a sample of n data points x_{1}, x_{2}, x_{3}, ..., x_{n} that we think originated from a normal distribution (population) with a probability density function f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}.

x_{1}, x_{2}, x_{3}, ..., x_{n} \sim N(\mu, \sigma)

\mu and \sigma are the parameters of the normal probability density function. Let’s apply the Method of Moments to estimate these parameters from the sample. In other words, let’s formulate the equations for \mu and \sigma.

The first population moment is E[X^{1}]. The first sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{1}}.

For the normal distribution, we know that the expected value E[X] is the mean \mu.

\mu = E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

The second population moment is E[X^{2}]. The second sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

E[X^{2}]. Hmm. You have to go down the memory lane, to Lesson 26 to realize that E[X^{2}]=V[X] + (E[X])^{2}.

.
.
.
Did you see that? Yes, so E[X^{2}]=\sigma^{2} + \mu^{2}. Equating this to the sample second moment, we get

\sigma^{2} + \mu^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - \mu^{2}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - (\bar{x})^{2}

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - n\bar{x}^{2}\Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2n\bar{x}^{2} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}n\bar{x} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}\sum x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + \sum \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \sum (x_{i}^{2} - 2\bar{x} x_{i} + \bar{x}^{2})

\sigma^{2} = \frac{1}{n} \sum (x_{i} - \bar{x})^{2}


Let’s spend a few more moments and solve for the parameters of the Gamma distribution.

The probability density function of the Gamma distribution is

f(x) = \frac{\lambda e^{-\lambda x}(\lambda x)^{r-1}}{(r-1)!}

It has two parameters, r and \lambda.

\lambda is called the scale parameter. It controls the width of the distribution, and r is called the shape parameter because it controls the shape of the distribution.

We know that the expected value of the Gamma distribution E[X] = \frac{r}{\lambda}. If you forgot this, solve the integral \int xf(x)dx for the Gamma distribution. Equating this first population moment to the sample moment, we can see that

\frac{r}{\lambda} = \frac{1}{n}\sum x_{i} = \bar{x}

r = \lambda \bar{x}

Then, we know that V[X] for the Gamma distribution is \frac{r}{\lambda^{2}}.

Using this with the second population moment, E[X^{2}]=V[X] + (E[X])^{2}, we can see that E[X^{2}] = \frac{r(r+1)}{\lambda^{2}}.

We can now equate the second population moment to the second sample moment.

E[X^{2}] = \frac{1}{n}\sum x_{i}^{2}

\frac{r(r+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\lambda \bar{x} (\lambda \bar{x}+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\bar{x}(\lambda \bar{x}+1)}{\lambda} = \frac{1}{n}\sum x_{i}^{2}

 \bar{x} + \lambda \bar{x}^{2} = \frac{\lambda}{n}\sum x_{i}^{2}

\lambda (\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}) = \bar{x}

\lambda = \frac{\bar{x}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

So,

r = \frac{\bar{x}^{2}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

Very tiring, I know. You will rarely find all the algebra in textbooks. So, the moments are worth it.

I asked you to spare a moment. Now you are going “this is quite a moment.” What do you think? Would you remember the Method of Moments next time you spare a moment?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.