August 2018 – dataanalysisclassroom

Lesson 75 – Fiducia on the variance

This morning officer Jones is deployed to NY state highway 17A to monitor the speeding vehicles. Its been a rather damp morning; his computer has recorded the speeds of only 20 vehicles so far.

He is nicely tucked away in a corner where jammers cannot find him. With no vehicles in sight, he started playing with the data.

He computes the average vehicle speed.

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$

Mean vehicle speed $\bar{x} = 50.10$ miles per hour.

But officer Jones is a variance freak. He computes the sample variance and the sample standard deviation.

$s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$

The sample variance is 5.9 $mph^{2}$ , and the sample standard deviation is 2.43 mph.

“I wonder how large this deviation can get,” thought officer Jones.

Can we answer his question?

The answer lies in knowing the confidence interval on the variance.

We now know how to construct the confidence interval on the true mean based on the sample.

The interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ is the $100(1-\alpha)\%$ confidence interval of the population mean $\mu$ .

The basis for this is the idea that we can find an interval that has a certain probability of containing the truth.

In other words, a $100(1-\alpha)\%$ confidence interval for a parameter $\theta$ is an interval [l, u] that has the property $P(l \le \theta \le u) = 1-\alpha$ .

We can apply the same logic to the variance. Office Jones is wondering about the true standard deviation $\sigma$ or true variance $\sigma^{2}$ . He is wondering if he can derive the interval for the true variance given his limited data.

How can we derive the confidence interval on the variance?

Remember what we did for deriving the confidence interval on the mean. We investigated the limiting distribution of $\bar{x}$ , the sample mean and used the probability statement $P(l \le \theta \le u) = 1-\alpha$ on this distribution to derive the end-points, the upper and lower confidence limits.

Let’s do the same thing here.

What is the limiting distribution for sample variance?

Those who followed lesson 73 carefully will jump up to say that it is related to the Chi-square distribution.

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom.

Okay, no problem. We will do it again.

Let’s take the equation of the sample variance $s^{2}$ and find a pattern in it.

$s^{2} = \frac{1}{n-1} \sum(x_{i}-\bar{x})^{2}$

Move the $n-1$ over to the left-hand side and do some algebra.

$(n-1)s^{2} = \sum(x_{i}-\bar{x})^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu -\bar{x} + \mu)^{2}$

$(n-1)s^{2} = \sum((x_{i} - \mu) -(\bar{x} - \mu))^{2}$

$(n-1)s^{2} = \sum[(x_{i} - \mu)^{2} + (\bar{x} - \mu)^{2} -2(x_{i} - \mu)(\bar{x} - \mu)]$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + \sum (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)\sum(x_{i} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(\sum x_{i} - \sum \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2(\bar{x} - \mu)(n\bar{x} - n \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)(\bar{x} - \mu)$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} + n (\bar{x} - \mu)^{2} -2n(\bar{x} - \mu)^{2}$

$(n-1)s^{2} = \sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2}$

Let’s divide both sides of the equation by $\sigma^{2}$ .

$\frac{(n-1)s^{2}}{\sigma^{2}} = \frac{1}{\sigma^{2}}(\sum(x_{i} - \mu)^{2} - n (\bar{x} - \mu)^{2})$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - \frac{n}{\sigma^{2}} (\bar{x} - \mu)^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}} = \sum(\frac{x_{i} - \mu}{\sigma})^{2} - (\frac{\bar{x} - \mu}{\sigma/\sqrt{n}})^{2}$

The right-hand side now is the sum of squared standard normal distributions.

$\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$ ; sum of squares of $(n - 1)$ standard normal random variables.

As we learned in lesson 53, if there are n standard normal random variables, $Z_{1}, Z_{2}, ..., Z_{n}$ , their sum of squares is a Chi-square distribution with n degrees of freedom.

Its probability density function is $f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!}$ for $\chi > 0$ and 0 otherwise.

Since we have $\frac{(n-1)s^{2}}{\sigma^{2}} = Z_{1}^{2} + Z_{2}^{2} + Z_{3}^{2} + ... + Z_{n}^{2} - Z^{2}$

$\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with $(n-1)$ degrees of freedom.

$f(\frac{(n-1)s^{2}}{\sigma^{2}}) = \frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n-1}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n-1}{2}-1)!}$

Depending on the degrees of freedom, the distribution of $\frac{(n-1)s^{2}}{\sigma^{2}}$ looks like this.

Smaller sample sizes imply lower degrees of freedom and the distribution will be highly skewed; asymmetric.

Larger sample sizes or higher degrees of freedom will tend the distribution to symmetry.

We can now apply the probability equation and define a $100(1-\alpha)\%$ confidence interval for the true variance $\sigma^{2}$ .

$P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha$

$\alpha$ is between 0 and 1. You know that a 95% confidence interval means $\alpha = 0.05$ , and a 99% confidence interval means $\alpha = 0.01$ .

$\chi_{l,n-1}$ and $\chi_{u,n-1}$ are the lower and upper critical values from the Chi-square distribution with $n-1$ degrees of freedom.

The main difference here is that the Chi-square distribution is not symmetric. We should calculate both the lower limit and the upper limit that correspond to a certain level of probability.

Take for example, a 95% confidence interval. We need $\chi_{l,n-1}$ and $\chi_{u,n-1}$ , the quantiles that yield a 2.5% probability in the right tail and 2.5% probability in the left tail.

$P(\chi \le \chi_{l,n-1}) = 0.025$

and

$P(\chi > \chi_{u,n-1}) = 0.025$

Going back to the probability equation
$P(\chi_{l,n-1} \le \frac{(n-1)s^{2}}{\sigma^{2}} \le \chi_{u,n-1}) = 1-\alpha$

With some rearrangement within the inequality, we get

$P(\frac{(n-1)s^{2}}{\chi_{u,n-1}} \le \sigma^{2} \le \frac{(n-1)s^{2}}{\chi_{l,n-1}}) = 1-\alpha$

The interval $[\frac{(n-1)s^{2}}{\chi_{u,n-1}}, \frac{(n-1)s^{2}}{\chi_{l,n-1}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population variance $\sigma^{2}$ .

We can get the square roots of the confidence limits to get the confidence interval on the true standard deviation.

The interval $[\sqrt{\frac{(n-1)s^{2}}{\chi_{u,n-1}}}, \sqrt{\frac{(n-1)s^{2}}{\chi_{l,n-1}}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population standard deviation $\sigma$ .

Let’s now go back to the data and construct the confidence interval on the variance and the standard deviation.

Officer Jones has a sample of 20 vehicles. $n=20$ . The sample variance $s^{2}$ is 5.9 and the sample standard deviation s is 2.43.

Since n = 20, $\frac{(n-1)s^{2}}{\sigma^{2}}$ follows a Chi-square distribution with 19 degrees of freedom. The lower and upper critical values at the 95% confidence interval $\chi_{l,19}$ and $\chi_{u,19}$ are 8.90 and 32.85.

You must be wondering about the tediousness in computing these quantiles each time there is a sample.

That problem is solved for us. Like the t-table, the upper tail probabilities of the Chi-square distribution for various degrees of freedom are tabulated in the Chi-square table.

You can get them from any statistics textbooks, or a simple internet search on “Chi-square table” will do. Here is an example.

The first column is the degrees of freedom. The subsequent columns are the computed values for $P(\chi > \chi_{u})$ , the right tail probability.

For our example, say we are interested in the 95% confidence interval, we look under df = 19 and identify $\chi^{2}_{0.975} = 8.90$ as the lower limit that yields a probability of 2.5% on the left tail, and $\chi^{2}_{0.025} = 32.85$ as the upper limit that yields a probability of 2.5% on the right tail.

Substituting these back into the confidence interval equation, we get

$P(\frac{19*5.90}{32.85} \le \sigma^{2} \le \frac{19*5.90}{8.90}) = 0.95$

The 95% confidence interval for the true variance is [3.41, 12.59].

If we take the square root of these limits, we get the 95% confidence interval for the true standard deviation.

$P(1.85 \le \sigma \le 3.54) = 0.95$

The 95% confidence interval for the true variance is [1.85 3.54].

So, to finally answer office Jones’ question, at the 95% confidence level, the deviation in vehicle speed can be as large as 3.54 mph and as low as 1.85 mph. The interval itself is random. As you know by now the real interpretation is:

If we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample variance $\sigma^{2}$ and the Chi-square critical limits from the Chi-square distribution, in the long-run, 95% of these intervals will contain the true value of $\sigma^{2}$ .

A few vehicles must have sneaked by in the short run.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 74 – Deriving confidence from t

What we know so far

The $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ is $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$

If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown $\sigma$ .

However, for small sample sizes, the sample standard deviation s is itself subject to error. In other words, it may be far from the true value of $\sigma$ . Hence, we cannot assume that $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ will tend to a normal distribution, which was the basis for deriving the confidence intervals in the first place.

Last week’s mathematical excursion took us back in time, introduced us to “Student” and the roots of his famous contribution, the Student’s t-distribution.

“Student” derived the frequency distribution of $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ to be a t-distribution with (n-1) degrees of freedom. We paddled earnestly through a stream of functions to arrive at this.

$f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}$

The probability of t within any limits is fully known if we know n, the sample size of the experiment. The function is symmetric with an expected value and variance of: $E[T] = 0$ and $V[T] = \frac{n-1}{n-3}$ .

While the t-distribution resembles the standard normal distribution Z, it has heavier tails, i.e., it has more probability in the tails than the normal distribution. As the sample size increases ( $n \to \infty$ ) the t-distribution approaches Z.

Can we derive the confidence interval from the t-distribution?

Let’s follow the same logic used to derive the confidence interval from the normal distribution. The only difference now will be that

$\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t$

Suppose we are interested in deriving the 95% confidence interval for the true mean $\mu$ , we can use the probability rule

$P(-t_{0.025,(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{0.025,(n-1)}) = 0.95$

It is equivalent to saying that there is a 95% probability that the variable $\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ is between $-t_{0.025,(n-1)}$ and $t_{0.025,(n-1)}$ , the 2.5 percentile value of t.

$t_{0.025,(n-1)}$ is like $Z_{0.025}$ . While $Z_{0.025} = -1.96$ , $t_{0.025,(n-1)}$ will depend on the sample size n.

Notice I am using (n-1), the degrees of freedom in the subscript for t to denote the fact that the value will be different for a different sample size.

To generalize, we can define a $100(1-\alpha)\%$ confidence interval for the true mean $\mu$ using the probability equation

$P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha$

$\alpha$ is between 0 and 1. For a 95% confidence interval, $\alpha = 0.05$ , and for a 99% confidence interval, $\alpha = 0.01$ . Like how the Z-critical value is denoted using $Z_{\frac{\alpha}{2}}$ , the t-critical value can be denoted using $t_{\frac{\alpha}{2},(n-1)}$ .

We can modify the inequality in the probability equation to arrive at the confidence interval.

$P(-t_{\frac{\alpha}{2},(n-1)} \le \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \le t_{\frac{\alpha}{2},(n-1)}) = 1-\alpha$

Multiplying throughout by $\frac{s}{\sqrt{n}}$ , we get

$P(-t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}} \le \bar{x}-\mu \le t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha$

Subtracting $\bar{x}$ and multiplying by -1 throughout, we get

$P(\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}) = 1-\alpha$

This interval $[\bar{x} - t_{\frac{\alpha}{2},(n-1)}\frac{s}{\sqrt{n}}, \bar{x} + t_{\frac{\alpha}{2},(n-1)} \frac{s}{\sqrt{n}}]$ is called the $100(1-\alpha)\%$ confidence interval of the population mean.

As we discussed before in lesson 72, the interval itself is random since it is derived from $\bar{x}$ and s. A different sample will have a different $\bar{x}$ and s, and hence a different interval or range.

Solving Jenny’s problem

Let us develop the 95% confidence intervals of the mean water quality at the Rockaway beach. Jenny was using the data from the ROCKAWAY BEACH 95TH – 116TH. She identified 48 data points (n = 48) from 2005 to 2018 that exceeded the detection limit.

The sample mean $\bar{x}$ is 7.9125 counts per 100 ml. The sample standard deviation s is 6.96 counts per 100 ml.

The 95% confidence interval of the true mean water quality ( $\mu$ ) is

$\bar{x} - t_{0.025,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.025,(n-1)} \frac{s}{\sqrt{n}}$

How can we get the value for $t_{0.025,(n-1)}$ , the 2.5 percentile from the t-distribution with (n-1) degrees of freedom?

You must have started integrating the function $f(t) = \frac{(\frac{n-2}{2})!}{\sqrt{\pi(n-1)}} \frac{1}{(\frac{n-3}{2})!} \frac{1}{(1+\frac{t^{2}}{n-1})^{\frac{n}{2}}}$ into its cumulative density function $F(t)$ .

Save the effort. These are calculated already and are available in a table. It is popular as the t-table. You can find it in any statistics textbook, or simply type “t-table” in any search engine and you will get it. There may be slight differences in how the table is presented. Here is an example.

This table shows the right-sided t-distribution critical value $t_{\frac{\alpha}{2},(n-1)}$ . Since the t-distribution is symmetric, the left tail critical values are $-t_{\frac{\alpha}{2},(n-1)}$ . You must have noticed that the last row is indicating the confidence level $100(1-\alpha)\%$ .

Take, for instance, a 95% confidence interval, $\frac{\alpha}{2}=0.025$ . The upper tail probability p is 0.025. From the table, you look into the sixth column under 0.025 and slide down to $df=n-1$ . For instance, if we had a sample size of 10, the degrees of freedom are df = 9 and the t-critical ( $t_{0.025,9}$ ) will be 2.262; like this:

⇔

Since our sample size is 48, the degrees of freedom df = 47.

In the sixth column under upper tail probability 0.025, we should slide down to df = 47. Since there is no value for df = 47, we should interpolate from the values 2.021 (df = 40) and 2.009 (df = 50).

The t-critical value for 95% confidence interval and df = 47 is 2.011. I got it from R. We will see how in an R lesson later.

This table is also providing $Z_{\frac{\alpha}{2}}$ at the end. See $z^{*}=1.96$ for a 95% confidence interval.

Let’s compute the 95% confidence interval for the mean water quality.

$\bar{x} - 2.011\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + 2.011\frac{s}{\sqrt{n}}$

$7.9125 - 2.011\frac{6.96}{\sqrt{48}} \le \mu \le 7.9125 + 2.011\frac{6.96}{\sqrt{48}}$

$5.892 \le \mu \le 9.933$

Like in the case of the confidence interval derived from the normal distribution, if we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample mean ( $\bar{x}$ ), sample standard deviation (s) and the t-critical from the t-distribution, in the long-run, 95% of these intervals will contain the true value of $\mu$ .

Here is how eight different 95% confidence intervals look relative to the truth. These eight intervals are constructed based on the samples from eight different locations. In the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

I am also showing the confidence intervals derived from the normal distribution and known variance assumption ( $\sigma=9.95$ ). They are in green color.

Can you spot anything?

How do they compare?

What can we learn about the width of the intervals derived from the normal distribution (Z) and the t-distribution?

Is there anything that is related to the sample size?

Think about it until we come back with a lesson in R for confidence intervals.

There are other applications of the t-distribution that we will learn in due course of time.

Remember that you will cross a “t” whenever there is error.

I will end with these notes from Fisher in his paper “Student” written in 1939 in remembrance of W.S. Gosset.

“Five years, however, passed, without the writers in Biometrika, the journal in which he had published, showing any sign of appreciating the significance of his work. This weighty apathy must greatly have chilled his enthusiasm.”

“The fruition of his work was, therefore, greatly postponed by the lack of appreciation of others. It would not be too much to ascribe this to the increasing dissociation of theoretical statistics from the practical problems of scientific research.”

It is now 110 years since he published his famous work using a pseudo name “Student.” Suffice it to say that “Student” and his work will still be appreciated 110 years from now, and people will derive confidence from the “t.”