Lesson 65 – Most Likelihood

“Maybe there is one estimate.”

Jill is staring at the plot on his desktop. The 6×6 cubicle and the gloomy weather outside is certainly not helping his concentration. Jill, the new intern at the Department of Environmental Protection, is working on developing estimates for the probability of water quality violations in summer at Pier 84, NYC’s Hudson River Park. To toy around with the idea, his boss gave him a ‘sample data’ for seven days during August 1 – August 7, 2016.

“I should perhaps develop two estimates; the probability of violation during the day and probability of violation during the night.”

Dissolved oxygen (DO) is used as an indicator by the New York State Department of Environmental Conservation to measure ecological health. A typical standard on acceptable water quality is a DO greater than 5 mg/L.

Jill saved the excel file, unplugged the pen-drive from his computer, tossed it in his backpack along with a few other folders, and did something on his phone. He took a deep breath, got up and walked out to the front door of his office.

It was around 6 pm when he stepped out.

♦♦♦

A black ford with an uber sticker on the front pane stopped in front of an apartment building on 96th street and Amsterdam Avenue. Jill is back at his apartment. Joe, his roommate, has already ordered Chinese. They plan to work on Jill’s problem over the evening along with writing the end of the semester project reports. The floor in the living room is filled with books, papers of all sort, pens, and markers. Along the glazed window facing Amsterdam Avenue, there are half-burnt candles. It is their finals week.

“I think two separate estimates will be appropriate,” said Jill to Joe as they both look at the plot on Jill’s laptop.

“The sample data I have has 672 data points of measured DO every 15 minutes. I have to develop estimates for p = P(DO < 5). I am thinking of dividing the data into daytime observations and nighttime observations and provide p_{day} and p_{night}. But, a single estimate for all the data, p_{full} may also be okay.

“Is there reason to believe that there can be two separate estimates?” asked Joe.

“Yes. Did you see that DO is higher during the day and is low during the night? I have done some study on this and found that such a daily cycle is due to the day-night change in phytoplankton growth. The DO increases at sunrise as phytoplankton begins photosynthesis and produces oxygen. During later afternoon and evening, the chlorophyll starts decreasing, and during the night, it is low.”

“I started making some simple assumptions on the day and night measurements. Morning 6 am to evening 6 pm is daytime and 6 pm to 6 am in the morning is nighttime. In the ‘sample data,’ there are 364 daytime measurements and 308 nighttime measurements. During the daytime, there are 185 violations, i.e., when DO is less than 5 mg/L. During the nighttime, I found 233 violations. So,

p_{day} = \frac{185}{364} = 0.51

p_{night} = \frac{233}{308} = 0.76.

However, a small part of me is saying, I count all data and provide

p_{full} = \frac{418}{672} = 0.62,

since the overall violations are 418 for the 672 measurements.

I am not able to decide.”

Why don’t you test both your assumptions using the maximum likelihood approach?” Joe suggested. “Since the maximum likelihood estimation method computes the parameters of the population distribution which agree most closely with the sample observations, you can test your data with these two formulations. The formulation that gives the most likelihood will win. What say?”

“That is an interesting thought. How do we set up such a test?” asked Jill as he cracks up his fortune cookie.

“Simple. Since you are looking for violations, it is safe to assume a Bernoulli distribution for the data. X = [0,1], 0 if DO > 5 mg/L and 1 if DO < 5 mg/L is the distribution with a parameter p, the probability of the violation. Your sample data [x_{1}, x_{2}, ..., x_{n}] can be used to find the optimal parameter that maximizes the likelihood (joint probability of occurrence) of observing the sample.”

As Joe stretches to grab a piece of paper, Jill speaks up.

“Okay, let’s say there are two different estimators to estimate the probability of violation; p_{full} assumes that all the data comes from the same Bernoulli distribution, and p_{day},p_{night} assumes that there are different Bernoulli distributions for nighttime samples and daytime samples. You mean to say, we can test these two theories?”

“Precisely,” said Joe as he shows his scribble on the paper.

Assumption 1: Full data comes from the same population.

Assumption 2: Partition the data into night and day.

Jill seems to be excited. He is already visualizing himself presenting this testing framework to his boss. “Let’s compute the log-likelihood for the first assumption.” He pulls up his laptop to open lesson 63 of the data analysis classroom where Joe and Devine discussed the likelihood function for the Bernoulli distribution.

“The log-likelihood function for Bernoulli distribution is ln(L) = rln(p)+(n-r)ln(1-p). n is the number of samples, 672 in our study, r is the number of 1s in the sample data, 418 total violations, and p is the parameter, the probability of violation. p_{full}=0.62.”

“Yes, and the log-likelihood is -445.59,” said Joe as he completes the equation on the paper.

Assumption 1: Full data comes from the same population.

ln(L_{1}) = r_{full}ln(p_{full})+(n_{full}-r_{full})ln(1-p_{full})

ln(L_{1}) = 418*ln(0.62)+(672 - 418)ln(0.38)

ln(L_{1}) = -445.59

Jill gives an agreeable nod. “How can we write the log-likelihood equation for the second assumption?” he asks immediately.

Joe wastes no time. He pens down a few quick lines.

L_{2} = {\displaystyle \prod_{i=1}^{n_{day}}f(x_{i}|p_{day})}{\displaystyle \prod_{i=1}^{n_{night}}f(x_{i}|p_{night})}

L_{2} = p_{day}^{r_{day}}*(1-p_{day})^{n_{day}-r_{day}}*p_{night}^{r_{night}}*(1-p_{night})^{n_{night}-r_{night}}

“If the data is partitioned into night and day, there will be two functions, one for the day and one for the night. If the sample is independent and identically distributed, the overall likelihood is the product of the two equations,” he says. Be watchful with the assumption of independence. You have to check for this.

Jill stares at it for a bit, looked up the ceiling and seemed to concur with the thought.

“Let me write the log-likelihood function for this,” he said as he starts writing the equation on the paper.

ln(L_{2}) = r_{day}ln(p_{day}) + (n_{day}-r_{day})ln(1 - p_{day}) + r_{night}ln(p_{night}) + (n_{night}-r_{night})ln(1 - p_{night}) 

ln(L_{2}) = 185ln(0.51) + (364 - 185)ln(0.49) + 223ln(0.76) + (308 - 223)ln(0.24)

ln(L_{2}) = -434.76

“So, ln(L_{1}) = -445.59 and ln(L_{2}) = -434.76. ln(L_{2}) > ln(L_{1}). Clearly, Assumption 2 seems to be more appropriate based on the data and the maximum likelihood method.”

“Yes. You can show this work tomorrow and ask for more data from your boss,” said Joe as he grabs some papers from the floor.

“Absolutely. I will prepare a few graphics with data and equations and show it to him. I think he will like it.”

So, as the night grows, Joe starts his project report, and Jill is preparing himself for his meeting tomorrow.

♦♦♦

A thought on the precision of the estimate has come to mind as Jill is typing up his equations.

“The maximum likelihood estimation method seems to be an elegant way to come up with parameters for probability distributions. The likelihood itself can be used to test assumptions, as we did here. But, are we confident enough that the maximization produces precise estimates? I mean, maximization is like finding the peak of a function. How can we be confident that the value where we find the peak is very precise?” asked Jill.

“Well, since you have 600 data points, your estimate will be precise. Precision is a function of sample size. Smaller sample sizes will yield less precise estimates.” Joe said uninterrupted from his writing.

“How do you mean?”

“Okay. Let’s take your example. The log-likelihood function for Bernoulli distribution. ln(L) = rln(p)+(n-r)ln(1-p). \hat{p}=\frac{r}{n} is the maximum likelihood estimate of p. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample. However, for smaller sample sizes, the parameter is not estimated very precisely.”

“Can you illustrate it?”

“Sure. Give me a second.” Joe types up a few lines of code in his RStudio console and creates a figure. “Look at this,” he said.

“This graphic shows the log-likelihood (L_{1}) for various values of p and a sample size n = 10. Imagine you had a sample of 10 observations with 6 or 7 violations. The peak of the function is at p = 0.6 or so, but around that region, the function is flat. So, pinpointing the peak at 0.62 is not possible. Less precise estimate.”

“To see this better, I computed the slope of the points from the maximum value of the log-likelihood. The difference in the log-likelihood at every point from the peak. This difference (slope) is another function that shows the precision. At the best estimate, 0.6 ish, the slope function should be very steep to say that the estimate is precise. The function is peaky at the best estimate. If the slope function is flat, the estimate is less precise.”

“I created this difference function for n = 20, 50, and 200 to see how the slope changes with increasing sample size. This is the graphic you see here with four different function lines. Can you see any pattern?”

Jill squints his eyes as he looks at the figures. “Yes, I see the slope function is flat for n = 10 and gets peaky as n increases. It is very steep at 0.62 for n = 200. The more sample we have, the more precise estimates we can get.”

“Should I include this too when I talk tomorrow?”

“Why not. The more information you present, the better. It shows your thought process and rigor.”

“Also, show that the maximum likelihood estimation method has an invariance property.

“What is that?”

“If \hat{\theta} is the maximum likelihood estimator for a parameter \theta, then the maximum likelihood estimator of any function g(\theta) is g(\hat{\theta}), the same function g of the estimator.”

“Think about variance and standard deviation for example. We know that the maximum likelihood estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Standard deviation \sigma = \sqrt{\sigma^{2}}, is a function of the variance. So, the maximum likelihood estimator for standard deviation is \hat{\sigma}=\sqrt{\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}}.”

“Since you are having large sample sizes, while you discuss properties, you can throw in the fact that the maximum likelihood method produces estimates (\hat{\theta}) that are approximately normally distributed, and approximately unbiased estimates with minimum variance.”

“Wait, you just threw in a few new words like bias and minimum variance, and you are asking me to throw it on my boss!”

Jill looks clearly lost. It may be late in the night also. They both go back to doing their works, hoping to sail through hurricane FINALS.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)