Lesson 84 – Beyond a reasonable doubt: Introducing hypothesis tests

Tom grew up in the City of Ruritania. He went to high school there, met his wife there, and has a happy home. Ruritania, the land of natural springs is known for its pristine water. Lately, he has been sensing a decline in the quality of his tap water. It is a scary feeling as the consequences are slow. He starts associating the cause of the poor water quality to this new factory in his neighborhood.

You are Tom’s best friend. Seeing him be concerned, you took upon yourself, the responsibility of evaluating whether or not, the addition of this new factory in Tom’s neighborhood reduced the quality of water compared to historical water quality in the area. How would you test if there is a significant deviation from the historical average water quality?

Tom’s house is alongside the west branch of the Mohawk River and is downstream of this factory. You checked the EPA specifications for dissolved oxygen concentration in the river, and it is required by the EPA to have a minimum average concentration of 2mg/L. Over the next 10 days, you collected 10 water samples from the west branch. In mg/L, your data reads like this.

1.8, 2, 2.1, 1.7, 1.2, 2.3, 2.5, 2.9, 1.9, 2.2.

Is the river water quality satisfactory by the EPA standards?

In this whole investigative process, you also happened to collect water samples from the east branch of the Mohawk. The east branch is farther from the factory. How would you determine whether the concentrations of the contaminant (if any) in the two rivers are similar or different?

I think you are grasping the significance of the issues here. You need evidence beyond a reasonable doubt.

For Tom, you have collected data to make inference about the underlying process. The data sample represents that process. You may be having some prior idea on what happened or what changed. Let’s say you have a hypothesis.

You can test your hypothesis or prior assumption/idea by performing hypothesis tests using the data collected.

A hypothesis is a statement about something you are observing. In our language, it is a statement about one or more parameters of the population.

The hypothesis has to be substantiated with evidence provided from the data.

Statistical hypothesis tests provide a quantitative way of substantiating the belief or rejecting them or modifying the original hypothesis. In other words, a hypothesis test is a quantitative approach to determine whether your speculation can be substantiated.

The strength of the evidence can be measured and you can decide on whether or not to reject the hypothesis based on some risk measure, the risk that your decision may be incorrect.

Take Tom’s issue for instance. You may collect data on the current water quality and test it against the historical average water quality. You may start with a prior belief, a hypothesis that the water quality did not change since the inception of the new factory, i.e., there is no difference between current water quality and historical water quality in Ruritania. Any odd differences you see in the data are purely by chance. The non-existence of any difference is your null hypothesis.

You set this up against the alternate hypothesis that there is a change in the water quality since the inception of the factory, and it may not be purely due to chance. If there is evidence of a significant change, you can reject the prior belief, i.e., you can reject the null hypothesis for the alternate.

If there is no significant change, then, we cannot reject the null hypothesis. We continue to believe that the odd differences that were observed may be due to chance. This way, the change may be linked to its cause in a reverse fashion. Anyone who uses this data and the testing method should arrive at the same result.

Over the next few weeks, we will learn the main concepts of hypothesis tests. We will learn how to test sample data against a true value, how to test whether or not there are differences in two or more groups of data, and the various types of hypothesis tests, parametric if the data follows a particular distribution, and non-parametric, where we go distribution free and let the data explain the underlying differences.

So brace yourself for this exciting investigative journey where we try to disprove or refine our beliefs amidst uncertainty. “Beyond a reasonable doubt” is our credo. Whether you want to be Mr. Sherlock Holmes or Dr. Watson or Brother William of Baskerville is up to you.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)