Lesson 78 – To Err is Human: Beware of simplicity

Mumble used t-distribution to estimate the 99% confidence interval of the true vehicle speed on I95.

\bar{x} - t_{0.005,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.0205,(n-1)} \frac{s}{\sqrt{n}}

As he mentioned to his boss, the sample mean (\bar{x}) is 65 mph, the sample standard deviation (s) is 6 mph. For a sample of 50 vehicles (n = 50) the t-critical value denoted using t_{\frac{\alpha}{2},(n-1)} for the 99% confidence interval is 2.679.

Using these values in the formula, we get

65 - 2.679\frac{6}{\sqrt{50}} \le \mu \le 65 + 2.679\frac{6}{\sqrt{50}}

62.73 mph \le \mu \le 67.27 mph

His boss immediately identified that the length of the confidence interval is 67.27 – 62.73 = 4.54 mph and that the margin of error is greater than 2 mph.

You know from last week’s lesson 77 that the margin of error e is half of the length of the confidence interval. Since \bar{x} is the estimate of the true mean \mu, the error in estimating \mu using \bar{x} can be defined as e=|\bar{x}-\mu|.

I am using z_{\frac{\alpha}{2}} and \sigma instead of t_{\frac{\alpha}{2},(n-1)} and s as Mumble will probably make the same simplified assumption that many estimators make; i.e., approximate the interval using normal distribution instead of t-distribution to estimate the sample size for a selected precision.

Half-length of the confidence interval is z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}. The margin of error e should be within this for a 100(1-\alpha)\% confidence interval.

With e = z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} as the bound, Mumble will solve this equation for n.

n = (z_{\frac{\alpha}{2}}\frac{\sigma}{e})^{2}

n is a function of the selected margin of error e, the standard deviation \sigma and the confidence interval 100(1-\alpha)\%.

For greater margins of error, the required sample size n is smaller for a selected confidence interval and standard deviation \sigma.

For larger variability in the data (i.e., larger values of \sigma), the required sample size n increases for a fixed margin of error and specified confidence interval.

For greater confidence intervals (100(1-\alpha)\%), the required sample size increases for a fixed margin of error and specified confidence interval.

For a chosen level of reliability of the estimate, i.e., the chosen confidence interval, if we know how precise we need the interval to be, i.e., if we know the margin of error, we can come up the required sample size.

Now, since his boss gave him an acceptable margin of error of 1 mph, Mumble will use = 1 in this equation.

He will make two main assumptions. As mentioned above, he will use z_{\frac{\alpha}{2}} instead of t_{\frac{\alpha}{2},(n-1)}. In other words, he will use a z-critical of 2.58 instead of t-critical of 2.68. He will use the same value of 6 mph that he found originally as the sample standard deviation for \sigma.

With these assumptions, his new sample size is 240 vehicles. He has to go and collect data from 190 more vehicles in order to keep the margin of error within 1 mph.

The main reason he made these assumptions is that the problem needs a trial and error solution if he goes with t-critical and s since s and t-critical are dependent on n.

Look at how his sample size will vary with different levels of error for three different values of standard deviation, 5 mph, 6 mph, and 7 mph, still using z-critical instead of t-critical.

An increase in the sample size decreases the standard deviation. So by fixing the standard deviation at 6 mph, the one he obtained with 50 vehicles (smaller sample size), he is overestimating the new required sample size. Of course, some error still remains here since t-critical is wider than z-critical. Moreover, if the data is skewed with outliers or heavy tails, the estimate of sample standard deviation will be inflated and the resulting n will be large. Now, if the cost of collecting more data is not an issue, Mumble will probably not be that worried about this level of error.

Nevertheless, Beware of the assumptions. A lot of simplification goes into it.

I hope there is a way for computing the confidence intervals of different parameters without making a lot of assumptions, z, t, \chi.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 76 – What is your confidence in polls?

Looking at these two recent polls, one wonders which one is truer and what confidence we can have in polls.

Of course, that is a rhetorical question. The real question is if these polls are measuring proportions (or probability) of people agreeing to something, what is the confidence interval of the true proportion or probability.

For example, the first poll tells us that 16% of Americans overall said that they would like to permanently move to another country if they could. The second poll tells us that 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults. If so, what would be the confidence interval for the true proportion of all Americans who would answer these questions?

In lessons 71 to 75, we learned how to derive the confidence interval of the true mean, true variance, and true standard deviation, i.e., the population mean and population variance and standard deviation.

There are many applications, as these example polls, that require estimation of proportions or the probability of occurrence of events. We already know that the maximum likelihood estimator of p, the probability of an event is \hat{p}=\frac{r}{n} where r is the number of successes (times the event happened) in a sample of size n. In other words, the probability can be estimated as the proportion of occurrence in a Bernoulli sequence.

If \hat{p} is the estimate of the proportion, with a few assumptions, we can derive the confidence interval of the true proportion p.

Let’s learn how to do this using a simple polling exercise.

Assume that we want to estimate through a poll, the proportion of people who want to move out of the U.S. It will not be possible to ask everyone whether or not they will move out. However, we can take a sample, i.e., select a subset of the population and ask them for their preference.

In a sample of size n, the preference of the people can be represented by Bernoulli random variables X_{1}, X_{2}, X_{3}, …, X_{n} where X_{i} = 1 if a person wants to move out and 0 otherwise. If S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}, the proportion of people who wish to move out can be estimated as \hat{p} = \frac{S_{n}}{n}.

By now, you must be familiar that \hat{p} is a random variable since the estimate can change with a change in the sample. What assumption can we make for the distribution function of this random variable?

Since S_{n} is the sum of n independent random variables, for a large enough sample size n, the distribution function of S_{n} can be well-approximated by the normal distribution. Further, since \hat{p} is a linear function of S_{n}, the random variable \hat{p} can also be assumed to be normally distributed.

\hat{p} \sim N(E[\hat{p}], V[\hat{p}])

We can standardize \hat{p} and relate it to the standard normal distribution Z.

Z = \frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}} \sim N(0,1)

Before we proceed to derive the confidence interval, we should first derive the expected value and the variance of \hat{p}.

Expected Value E[\hat{p}]

\hat{p} = \frac{1}{n}\sum_{i=1}^{n}X_{i}

E[\hat{p}] = E[\frac{1}{n}\sum_{i=1}^{n}X_{i}] = \frac{1}{n}\sum_{i=1}^{n}E[X_{i}]

Since X_{i} is a Bernoulli distribution, E[X_{i}]=1(p) + 0(1-p) = p

E[\hat{p}]  = \frac{1}{n}\sum_{i=1}^{n}p

E[\hat{p}]  = \frac{1}{n}np = p

Variance V[\hat{p}]

V[\hat{p}] = V[\frac{1}{n}\sum_{i=1}^{n}X_{i}]

V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}V[X_{i}]

V[X_{i}] = E[X_{i}^{2}] - (E[X_{i}])^{2}

E[X_{i}^{2}] = 1^{2}(p) + 0^{2}(1-p) = p

V[X_{i}] = p - p^{2} = p(1-p)

So,

V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}p(1-p)

V[\hat{p}] = \frac{1}{n^{2}}np(1-p)

V[\hat{p}] = \frac{p(1-p)}{n}

Confidence Interval of \hat{p}

Now, if we are interested in the 95% confidence interval of the true estimate p, we can use the standardized version of \hat{p} to say that there is a 95% probability that the standard normal variable \frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}} is between -1.96 and 1.96.

P(-1.96 \le \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \le 1.96) = 0.95

We can rearrange this to obtain,

P(\hat{p} -1.96\sqrt{\frac{p(1-p)}{n}} \le p \le \hat{p} + 1.96\sqrt{\frac{p(1-p)}{n}}) = 0.95

We can use \hat{p} in place of p for the variance term.

This interval [\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}] is called the 95% confidence interval of the population proportion p. The interval itself is random since it is derived from \hat{p}. A different sample will have a different \hat{p} and hence a different interval or range.

There is a 95% probability that this random interval [\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}] contains the true value of p.

Put another way, if we use this method to estimate the confidence interval of p for a large number of samples we can expect that in about 95% of the samples the true value of p will be within the confidence interval obtained from the sample.

Let’s now compute the 95% confidence interval for the proportions we saw in the two polls. 16% of Americans overall said that they would like to permanently move to another country if they could. 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults.

95% confidence interval for the proportion of people who want to move out of the U.S.

[\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]

[0.16 - 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}, 0.16 + 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}]

[0.137 \le p \le 0.183]

is the 95% confidence interval of the true proportion.

95% confidence interval for the proportion of people that believe that 2019 will be a year of full or increasing employment.

[0.59 - 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}, 0.59 + 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}]

[0.56 \le p \le 0.62]

is the 95% confidence interval of the true proportion.

We can generalize this to any confidence level by defining a 100(1-\alpha)% confidence interval for the true proportion p as [\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]

Keep in mind that the main assumption behind this is that the estimate \hat{p} can be approximated by a normal distribution for a reasonably large sample size.

How do we know what size of the sample is sufficient? In the first graphic that showed the polls, I highlighted margin of errors. Can you guess what that is?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)