Lesson 78 – To Err is Human: Beware of simplicity

Mumble used t-distribution to estimate the 99% confidence interval of the true vehicle speed on I95.

$\bar{x} - t_{0.005,(n-1)}\frac{s}{\sqrt{n}} \le \mu \le \bar{x} + t_{0.0205,(n-1)} \frac{s}{\sqrt{n}}$

As he mentioned to his boss, the sample mean ( $\bar{x}$ ) is 65 mph, the sample standard deviation (s) is 6 mph. For a sample of 50 vehicles (n = 50) the t-critical value denoted using $t_{\frac{\alpha}{2},(n-1)}$ for the 99% confidence interval is 2.679.

Using these values in the formula, we get

$65 - 2.679\frac{6}{\sqrt{50}} \le \mu \le 65 + 2.679\frac{6}{\sqrt{50}}$

$62.73 mph \le \mu \le 67.27 mph$

His boss immediately identified that the length of the confidence interval is 67.27 – 62.73 = 4.54 mph and that the margin of error is greater than 2 mph.

You know from last week’s lesson 77 that the margin of error e is half of the length of the confidence interval. Since $\bar{x}$ is the estimate of the true mean $\mu$ , the error in estimating $\mu$ using $\bar{x}$ can be defined as $e=|\bar{x}-\mu|$ .

I am using $z_{\frac{\alpha}{2}}$ and $\sigma$ instead of $t_{\frac{\alpha}{2},(n-1)}$ and $s$ as Mumble will probably make the same simplified assumption that many estimators make; i.e., approximate the interval using normal distribution instead of t-distribution to estimate the sample size for a selected precision.

Half-length of the confidence interval is $z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$ . The margin of error e should be within this for a $100(1-\alpha)\%$ confidence interval.

With $e = z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$ as the bound, Mumble will solve this equation for n.

$n = (z_{\frac{\alpha}{2}}\frac{\sigma}{e})^{2}$

n is a function of the selected margin of error e, the standard deviation $\sigma$ and the confidence interval $100(1-\alpha)\%$ .

For greater margins of error, the required sample size n is smaller for a selected confidence interval and standard deviation $\sigma$ .

For larger variability in the data (i.e., larger values of $\sigma$ ), the required sample size n increases for a fixed margin of error and specified confidence interval.

For greater confidence intervals ( $100(1-\alpha)\%$ ), the required sample size increases for a fixed margin of error and specified confidence interval.

For a chosen level of reliability of the estimate, i.e., the chosen confidence interval, if we know how precise we need the interval to be, i.e., if we know the margin of error, we can come up the required sample size.

Now, since his boss gave him an acceptable margin of error of 1 mph, Mumble will use e = 1 in this equation.

He will make two main assumptions. As mentioned above, he will use $z_{\frac{\alpha}{2}}$ instead of $t_{\frac{\alpha}{2},(n-1)}$ . In other words, he will use a z-critical of 2.58 instead of t-critical of 2.68. He will use the same value of 6 mph that he found originally as the sample standard deviation for $\sigma$ .

With these assumptions, his new sample size is 240 vehicles. He has to go and collect data from 190 more vehicles in order to keep the margin of error within 1 mph.

The main reason he made these assumptions is that the problem needs a trial and error solution if he goes with t-critical and s since s and t-critical are dependent on n.

Look at how his sample size will vary with different levels of error for three different values of standard deviation, 5 mph, 6 mph, and 7 mph, still using z-critical instead of t-critical.

An increase in the sample size decreases the standard deviation. So by fixing the standard deviation at 6 mph, the one he obtained with 50 vehicles (smaller sample size), he is overestimating the new required sample size. Of course, some error still remains here since t-critical is wider than z-critical. Moreover, if the data is skewed with outliers or heavy tails, the estimate of sample standard deviation will be inflated and the resulting n will be large. Now, if the cost of collecting more data is not an issue, Mumble will probably not be that worried about this level of error.

Nevertheless, Beware of the assumptions. A lot of simplification goes into it.
I hope there is a way for computing the confidence intervals of different parameters without making a lot of assumptions, $z, t, \chi$ .

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)