Lesson 75 – Fiducia on the variance

This morning officer Jones is deployed to NY state highway 17A to monitor the speeding vehicles. Its been a rather damp morning; his computer has recorded the speeds of only 20 vehicles so far.

He is nicely tucked away in a corner where jammers cannot find him. With no vehicles in sight, he started playing with the data.

He computes the average vehicle speed.

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$

Mean vehicle speed $\bar{x} = 50.10$ miles per hour.

But officer Jones is a variance freak. He computes the sample variance and the sample standard deviation.

$s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$

The sample variance is 5.9 $mph^{2}$ , and the sample standard deviation is 2.43 mph.

“I wonder how large this deviation can get,” thought officer Jones.

Can we answer his question?

The answer lies in knowing the confidence interval on the variance.

We now know how to construct the confidence interval on the true mean based on the sample.

The interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ is the $100(1-\alpha)\%$ confidence interval of the population mean $\mu$ .

The basis for this is the idea that we can find an interval that has a certain probability of containing the truth.

In other words, a $100(1-\alpha)\%$ confidence interval for a parameter $\theta$ is an interval [l, u] that has the property $P(l \le \theta \le u) = 1-\alpha$ .

We can apply the same logic to the variance. Office Jones is wondering about the true standard deviation $\sigma$ or true variance $\sigma^{2}$ . He is wondering if he can derive the interval for the true variance given his limited data.

How can we derive the confidence interval on the variance?

Remember what we did for deriving the confidence interval on the mean. We investigated the limiting distribution of $\bar{x}$ , the sample mean and used the probability statement $P(l \le \theta \le u) = 1-\alpha$ on this distribution to derive the end-points, the upper and lower confidence limits.

Let’s do the same thing here.

What is the limiting distribution for sample variance?

Those who followed lesson 73 carefully will jump up to say that it is related to the Chi-square distribution.

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom.

Okay, no problem. We will do it again.

Let’s take the equation of the sample variance $s^{2}$ and find a pattern in it.

$s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}$

Move the $n-1$ over to the left-hand side and do some algebra.

$(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}$

$(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}$

$(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}$

Let’s divide both sides of the equation by $\sigma^{2}$ .

$\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}$

The right-hand side now is the sum of squared standard normal distributions.

$\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$ ; sum of squares of $(n - 1)$ standard normal random variables.

As we learned in lesson 53, if there are n standard normal random variables, $Z_{1}, Z_{2}, ..., Z_{n}$ , their sum of squares is a Chi-square distribution with n degrees of freedom.

Its probability density function is $f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!}$ for $\chi > 0$ and 0 otherwise.

Since we have $\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom.

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}$

Depending on the degrees of freedom, the distribution of $\frac{(n-1)s^{2}}{\sigma^{2}}$ looks like this.

Smaller sample sizes imply lower degrees of freedom and the distribution will be highly skewed; asymmetric.

Larger sample sizes or higher degrees of freedom will tend the distribution to symmetry.

We can now apply the probability equation and define a $100(1-\alpha)\%$ confidence interval for the true variance $\sigma^{2}$ .

$P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha$

$\alpha$ is between 0 and 1. You know that a 95% confidence interval means $\alpha = 0.05$ , and a 99% confidence interval means $\alpha = 0.01$ .

$\chi_{l,n-1}$ and $\chi_{u,n-1}$ are the lower and upper critical values from the Chi-square distribution with $n-1$ degrees of freedom.

The main difference here is that the Chi-square distribution is not symmetric. We should calculate both the lower limit and the upper limit that correspond to a certain level of probability.

Take for example, a 95% confidence interval. We need $\chi_{l,n-1}$ and $\chi_{u,n-1}$ , the quantiles that yield a 2.5% probability in the right tail and 2.5% probability in the left tail.

$P(\chi \le \chi_{l,n-1}) = 0.025$

and

$P(\chi > \chi_{u,n-1}) = 0.025$

Going back to the probability equation
$P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha$

With some rearrangement within the inequality, we get

$P(\frac{(n-1)s^{2}}{\chi_{u,n-1}} \le \sigma^{2} \le \frac{(n-1)s^{2}}{\chi_{l,n-1}}) = 1-\alpha$

The interval $[\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population variance $\sigma^{2}$ .

We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation.

The interval $[\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population standard deviation $\sigma$ .

Let’s now go back to the data and construct the confidence interval on the variance and the standard deviation.

Officer Jones has a sample of 20 vehicles. $n=20$ . The sample variance $s^{2}$ is 5.9 and the sample standard deviation s is 2.43.

Since n = 20, $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with 19 degrees of freedom. The lower and upper critical values at the 95% confidence interval $\chi_{l,19}$ and $\chi_{u,19}$ are 8.90 and 32.85.

You must be wondering about the tediousness in computing these quantiles each time there is a sample.

That problem is solved for us. Like the t-table, the upper tail probabilities of the Chi-square distribution for various degrees of freedom are tabulated in the Chi-square table.

You can get them from any statistics textbooks, or a simple internet search on “Chi-square table” will do. Here is an example.

The first column is the degrees of freedom. The subsequent columns are the computed values for $P(\chi > \chi_{u})$ , the right tail probability.

For our example, say we are interested in the 95% confidence interval, we look under df = 19 and identify $\chi^{2}_{0.975} = 8.90$ as the lower limit that yields a probability of 2.5% on the left tail, and $\chi^{2}_{0.025} = 32.85$ as the upper limit that yields a probability of 2.5% on the right tail.

Substituting these back into the confidence interval equation, we get

$P(\frac{19*5.90}{32.85} \le \sigma^{2} \le \frac{19*5.90}{8.90}) = 0.95$

The 95% confidence interval for the true variance is [3.41, 12.59].

If we take the square root of these limits, we get the 95% confidence interval for the true standard deviation.

$P(1.85 \le \sigma \le 3.54) = 0.95$

The 95% confidence interval for the true variance is [1.85 3.54].

So, to finally answer office Jones’ question, at the 95% confidence level, the deviation in vehicle speed can be as large as 3.54 mph and as low as 1.85 mph. The interval itself is random. As you know by now the real interpretation is:

If we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample variance $\sigma^{2}$ and the Chi-square critical limits from the Chi-square distribution, in the long-run, 95% of these intervals will contain the true value of $\sigma^{2}$ .

A few vehicles must have sneaked by in the short run.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Can we answer his question?

How can we derive the confidence interval on the variance?

Enjoy this blog? Please spread the word :)