Lesson 72 – Jenny’s confidence, on the average

Meet Jenny. Jenny is bright and intelligent and is known as “the problem solver” among her friends. She usually goes unnoticed in the crowd due to her calm and composed nature, but by god, she is assertive when it is most needed. Her analytical neurons are razor sharp and you cannot mumbo-jumbo with her. She loves science fiction, history, and occasional Woody Allen. Oh, and she likes swimming, surfing, summer, and beaches.

Her second summer Rockaway beach trip is coming up and she is excited. With all the surfing kit packed, she pulled up the NYC Beach Water Quality website to monitor the status. “The Enterococci Bacteria Count is within limit!” she thought.

She got busy with other things, but the thought of beach samples bothered her. “Where do they take these samples from? They show Rockaway beach, but does the sample represent the whole beach? How many samples do they take? If they use one sample, how do we know what the truth is? I have been to this place several times, I wonder what the true water quality of the beach is?” These questions kept her awake. I told you she is a problem solver.

The next morning she met Joe to discuss the problem. Friends of this classroom know who Joe is. He is now the resident expert on statistical topics.

So, you want to know the true water quality from the sample data. Why don’t you develop the confidence interval of the mean water quality? This will at least give you a range, an interval where the true water quality will be.

How can we do that using the sample? Or, maybe I should ask, can you explain what is an interval and how to derive it.

Okay. Let’s get some simple/sample data first. NYC Open Data should definitely have data on beaches. Here, they have a link for DOHMH Beach Water Quality Data. The description says that this is the data of water quality sample results collected by the Department of Health and Mental Hygiene from all New York City Beaches. Amazing! Which Rockaway beach are you going to?

I usually go to the one on the 95th Street and stay north.

We can take the data from ROCKAWAY BEACH 95TH – 116TH then. Let me show you how to write a code in R to extract this subset of the data.

Hey, I can do that with ease.

I forgot to tell you that Jenny is a member of the local girls who code club. She pulls up her Macbook and types a few lines. Meanwhile, Joe does some coding too on the same data.

There are samples that had a result below the detection limit. If I remove those from the ROCKAWAY BEACH 95TH – 116TH data, we are left with 48 data points collected at various times from 2005 to 2018. Look.

Wonderful. I am guessing that the blue color triangle in the histogram is the sample mean ( $\bar{x}$ ).

Yes. This is an estimate (good guess) for the average Enterococci Bacteria count for this part of the beach based on 48 samples over several years.

We can describe the uncertainty in this estimate using the confidence intervals. The key link is to realize that this estimate, i.e., the sample mean ( $\bar{x}$ ) is a random variable. For instance, if we were to take another data sample from different times of the year or over different parts of the beach, we would get a slightly different sample mean. The value of the estimate changes with the change of sample, so we can think of estimate as a range of values or a probability distribution.

Yes, that makes perfect sense, but what probability distribution does it follow?

The sample mean is given by the equation $\bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}$ .
$x_{i}$ s are independent and identically distributed random samples. Look at the equation carefully, it is the summation of random variables. Convolution. We learned in Lesson 48 that for a large enough sample size, this summation will converge to the normal distribution — Central Limit Theorem. As the samples grow (n becomes large), convolution or function multiplications yield a smooth center heavy and thin-tailed bell function — the normal density.

Ah, I see. So it is reasonable to assume a normal distribution for the sample mean $\bar{x}$ .

Yes, $\bar{x}$ follows a normal distribution with an expected value $E[\bar{x}]$ and variance $V[\bar{x}]$ .

$E[\bar{x}]=\mu$ . The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. This is in Lesson 67.

The variance of the sample mean $V[\bar{x}] =\frac{\sigma^{2}}{n}$ . We derived this in Lesson 68. Variance tells us how widely the estimate is distributed around the center of the distribution.

The standard deviation of the sample mean, or the standard error of the estimate is $\frac{\sigma}{\sqrt{n}}$ .

So, $\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})$

Jenny looks at the equation for a bit. She takes a piece of paper, draws on it and instantly types a few lines of code.

Here, I am showing this visually.

$\bar{x}$ is a normal distribution. It is centered on $\mu$ with a standard deviation of $\frac{\sigma}{\sqrt{n}}$ . One standard deviation range is $\mu \pm \frac{\sigma}{\sqrt{n}}$ , two standard deviations range is $\mu \pm 2\frac{\sigma}{\sqrt{n}}$ and three standard deviations range is $\mu \pm 3\frac{\sigma}{\sqrt{n}}$ .

Yup. The standard normal way of saying this is $Z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$ .

The Z-score!

Yes, let me ask you something. What is the area under the normal density curve between -1.96 and 1.96?

Looking up from the standard normal tables, we get $P(Z \le -1.96) = 0.025$ , which means the area on the right side of the tail $P(Z \ge 1.96) = 0.025$ . The area between -1.96 and 1.96 is 0.95. 95%.

There is a 95% probability that the standard normal variable $\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}$ is between -1.96 and 1.96.

$P(-1.96 \le \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} \le 1.96) = 0.95$

I will modify the inequality in this equation.

Multiplying throughout by $\frac{\sigma}{\sqrt{n}}$ , we get $P(-1.96 \frac{\sigma}{\sqrt{n}} \le \bar{x}-\mu \le 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95$

Subtracting $\bar{x}$ and multiplying by -1 throughout, we get

$P(\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}) = 0.95$

We derived that the probability of the true population mean $\mu$ lying between two end-points $\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}$ and $\bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}$ is 0.95.

This interval $[\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}]$ is called the 95% confidence interval of the population mean. The interval itself is random since it is derived from $\bar{x}$ . As we dicsused before, a different sample will have a different $\bar{x}$ and hence a different interval or range.

There is a 95% probability that this random interval $[\bar{x} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{x} + 1.96 \frac{\sigma}{\sqrt{n}}]$ contains the true value of $\mu$ .

Neat. So to generalize this to any confidence interval, we can replace 1.96 with a Z-critical value.

Yes. For instance, if we want a 99% confidence interval, we should find the Z-critical value that gives 99% area under the normal density curve.

That would be 2.58. So the 99% confidence interval for true mean $\mu$ is $\bar{x} - 2.58\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 2.58 \frac{\sigma}{\sqrt{n}}$ .

Usually, this % confidence interval is described using a confidence coefficient $1-\alpha$ where $\alpha$ is between 0 and 1. For a 95% confidence interval, $\alpha = 0.05$ , and for a 99% confidence interval, $\alpha = 0.01$ . The Z-critical value is denoted using $Z_{\frac{\alpha}{2}}$ .

Yes yes. For 95% confidence interval, $Z_{\frac{\alpha}{2}} = Z_{0.025} = 1.96$ .

In summary, we can define a $100(1-\alpha)$ % confidence interval for the true mean $\mu$ as

$[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$

Correct. $\bar{x}$ is the mean of a random sample of size n. The assumption is that the sample is drawn from a population with a true mean $\mu$ and true standard deviation $\sigma$ . The end-points $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ are called the lower and upper confidence limits.

Let us develop the confidence intervals of the mean water quality. The sample I have has n = 48 data points. The sample mean is $\bar{x}=7.9125$ counts per 100 ml.

.
.
.
Wait, we don’t know the value of $\sigma$ , the true standard deviation. There is still an unknown in the equation.

Assume it is 9.95 counts per 100 ml.

Where did that come from.

Take my word for now and get the confidence interval.

Jenny reluctantly does some calculations on a piece of paper.

The 95% confidence interval for the mean water quality is

$\bar{x} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + 1.96\frac{\sigma}{\sqrt{n}}$ .

$7.9125 - 1.96\frac{9.95}{\sqrt{48}} \le \mu \le 7.9125 + 1.96\frac{9.95}{\sqrt{48}}$ .

$7.064 \le \mu \le 8.761$

There is a 95% probability that the true mean water quality will be between 7.064 and 8.761. Now tell me where the 9.95 came from.

Well, strictly speaking, your statement should have been, “there is a 95% probability of selecting a sample for which the confidence interval will contain the true value of $\mu$ .”

Let me explain. $7.064 \le \mu \le 8.761$ can be true or can be false depending on the sample we obtained. $\bar{x}$ is a random variable, a probability distribution whose mean is $\mu$ . Our sample mean represents a random draw from this distribution. So, if we got a sample whose mean is close to the truth, the interval $[\bar{x} - Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{x} + Z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}]$ will contain the truth. If we got a sample who mean is somewhat far away from the truth, its confidence interval may not contain the truth. This graphic should make it more clear. The green interval does not contain the true value. The purple interval does. Both these sample means ( $\bar{x}$ ) are equally likely; it depends on what sample data we get.

Hmm. So, we should think of this phenomenon as a long-run relative frequency outcome. If we have a lot of different samples, and compute the 95% confidence intervals for these samples, in the long-run, 95% of these intervals will contain the true value of $\mu$ .

Exactly. The 95% confidence level is what would happen if a large number of random intervals were constructed; not for any particular interval.

So while you were coding to get the subset of data from the ROCKAWAY BEACH 95TH – 116TH, I downloaded all the data that had ROCKAWAY BEACH in the file. There are eight locations along the beach where these samples are taken. Assuming that these are eight different samples, I developed the 95% confidence intervals for each of these samples. Here is how they look relative to the truth. One of them does not contain the true $\mu$ . Like this, in the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.

How did you get the true value for $\mu$ ?

Based on the principle of consistency.

$\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0$ .

As n approaches infinity, the sample estimate approaches the true parameter. I took data from the eight beach locations and computed the overall mean and overall standard deviation. While they are not exactly the true values, based on the idea of consistency, we can assume that they are.

$\mu = 8.97$ and $\sigma = 9.95$ counts per 100 ml.

There you go. I see why you insisted on using $\sigma = 9.95$ .

You are welcome!

But not always we will have such a large sample. Most often, the data is limited. How can we know what the true standard deviation is?

In that case, you can use the sample standard deviation. In your sample data case it was 6.96 counts per 100 ml, I think.

Hmm, but that is very different from the true value. Aren’t we inducing more error into the estimation of the intervals?

Perhaps. Hey, do you want to take a break? Can I buy you a drink? Is Guinness okay?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)