Lesson 65 – Most Likelihood

“Maybe there is one estimate.”

Jill is staring at the plot on his desktop. The 6×6 cubicle and the gloomy weather outside is certainly not helping his concentration. Jill, the new intern at the Department of Environmental Protection, is working on developing estimates for the probability of water quality violations in summer at Pier 84, NYC’s Hudson River Park. To toy around with the idea, his boss gave him a ‘sample data’ for seven days during August 1 – August 7, 2016.

“I should perhaps develop two estimates; the probability of violation during the day and probability of violation during the night.”

Dissolved oxygen (DO) is used as an indicator by the New York State Department of Environmental Conservation to measure ecological health. A typical standard on acceptable water quality is a DO greater than 5 mg/L.

Jill saved the excel file, unplugged the pen-drive from his computer, tossed it in his backpack along with a few other folders, and did something on his phone. He took a deep breath, got up and walked out to the front door of his office.

It was around 6 pm when he stepped out.

♦♦♦

A black ford with an uber sticker on the front pane stopped in front of an apartment building on 96th street and Amsterdam Avenue. Jill is back at his apartment. Joe, his roommate, has already ordered Chinese. They plan to work on Jill’s problem over the evening along with writing the end of the semester project reports. The floor in the living room is filled with books, papers of all sort, pens, and markers. Along the glazed window facing Amsterdam Avenue, there are half-burnt candles. It is their finals week.

“I think two separate estimates will be appropriate,” said Jill to Joe as they both look at the plot on Jill’s laptop.

“The sample data I have has 672 data points of measured DO every 15 minutes. I have to develop estimates for p = P(DO < 5). I am thinking of dividing the data into daytime observations and nighttime observations and provide p_{day} and p_{night}. But, a single estimate for all the data, p_{full} may also be okay.

“Is there reason to believe that there can be two separate estimates?” asked Joe.

“Yes. Did you see that DO is higher during the day and is low during the night? I have done some study on this and found that such a daily cycle is due to the day-night change in phytoplankton growth. The DO increases at sunrise as phytoplankton begins photosynthesis and produces oxygen. During later afternoon and evening, the chlorophyll starts decreasing, and during the night, it is low.”

“I started making some simple assumptions on the day and night measurements. Morning 6 am to evening 6 pm is daytime and 6 pm to 6 am in the morning is nighttime. In the ‘sample data,’ there are 364 daytime measurements and 308 nighttime measurements. During the daytime, there are 185 violations, i.e., when DO is less than 5 mg/L. During the nighttime, I found 233 violations. So,

p_{day} = \frac{185}{364} = 0.51

p_{night} = \frac{233}{308} = 0.76.

However, a small part of me is saying, I count all data and provide

p_{full} = \frac{418}{672} = 0.62,

since the overall violations are 418 for the 672 measurements.

I am not able to decide.”

Why don’t you test both your assumptions using the maximum likelihood approach?” Joe suggested. “Since the maximum likelihood estimation method computes the parameters of the population distribution which agree most closely with the sample observations, you can test your data with these two formulations. The formulation that gives the most likelihood will win. What say?”

“That is an interesting thought. How do we set up such a test?” asked Jill as he cracks up his fortune cookie.

“Simple. Since you are looking for violations, it is safe to assume a Bernoulli distribution for the data. X = [0,1], 0 if DO > 5 mg/L and 1 if DO < 5 mg/L is the distribution with a parameter p, the probability of the violation. Your sample data [x_{1}, x_{2}, ..., x_{n}] can be used to find the optimal parameter that maximizes the likelihood (joint probability of occurrence) of observing the sample.”

As Joe stretches to grab a piece of paper, Jill speaks up.

“Okay, let’s say there are two different estimators to estimate the probability of violation; p_{full} assumes that all the data comes from the same Bernoulli distribution, and p_{day},p_{night} assumes that there are different Bernoulli distributions for nighttime samples and daytime samples. You mean to say, we can test these two theories?”

“Precisely,” said Joe as he shows his scribble on the paper.

Assumption 1: Full data comes from the same population.

Assumption 2: Partition the data into night and day.

Jill seems to be excited. He is already visualizing himself presenting this testing framework to his boss. “Let’s compute the log-likelihood for the first assumption.” He pulls up his laptop to open lesson 63 of the data analysis classroom where Joe and Devine discussed the likelihood function for the Bernoulli distribution.

“The log-likelihood function for Bernoulli distribution is ln(L) = rln(p)+(n-r)ln(1-p). n is the number of samples, 672 in our study, r is the number of 1s in the sample data, 418 total violations, and p is the parameter, the probability of violation. p_{full}=0.62.”

“Yes, and the log-likelihood is -445.59,” said Joe as he completes the equation on the paper.

Assumption 1: Full data comes from the same population.

ln(L_{1}) = r_{full}ln(p_{full})+(n_{full}-r_{full})ln(1-p_{full})

ln(L_{1}) = 418*ln(0.62)+(672 - 418)ln(0.38)

ln(L_{1}) = -445.59

Jill gives an agreeable nod. “How can we write the log-likelihood equation for the second assumption?” he asks immediately.

Joe wastes no time. He pens down a few quick lines.

L_{2} = {\displaystyle \prod_{i=1}^{n_{day}}f(x_{i}|p_{day})}{\displaystyle \prod_{i=1}^{n_{night}}f(x_{i}|p_{night})}

L_{2} = p_{day}^{r_{day}}*(1-p_{day})^{n_{day}-r_{day}}*p_{night}^{r_{night}}*(1-p_{night})^{n_{night}-r_{night}}

“If the data is partitioned into night and day, there will be two functions, one for the day and one for the night. If the sample is independent and identically distributed, the overall likelihood is the product of the two equations,” he says. Be watchful with the assumption of independence. You have to check for this.

Jill stares at it for a bit, looked up the ceiling and seemed to concur with the thought.

“Let me write the log-likelihood function for this,” he said as he starts writing the equation on the paper.

ln(L_{2}) = r_{day}ln(p_{day}) + (n_{day}-r_{day})ln(1 - p_{day}) + r_{night}ln(p_{night}) + (n_{night}-r_{night})ln(1 - p_{night}) 

ln(L_{2}) = 185ln(0.51) + (364 - 185)ln(0.49) + 223ln(0.76) + (308 - 223)ln(0.24)

ln(L_{2}) = -434.76

“So, ln(L_{1}) = -445.59 and ln(L_{2}) = -434.76. ln(L_{2}) > ln(L_{1}). Clearly, Assumption 2 seems to be more appropriate based on the data and the maximum likelihood method.”

“Yes. You can show this work tomorrow and ask for more data from your boss,” said Joe as he grabs some papers from the floor.

“Absolutely. I will prepare a few graphics with data and equations and show it to him. I think he will like it.”

So, as the night grows, Joe starts his project report, and Jill is preparing himself for his meeting tomorrow.

♦♦♦

A thought on the precision of the estimate has come to mind as Jill is typing up his equations.

“The maximum likelihood estimation method seems to be an elegant way to come up with parameters for probability distributions. The likelihood itself can be used to test assumptions, as we did here. But, are we confident enough that the maximization produces precise estimates? I mean, maximization is like finding the peak of a function. How can we be confident that the value where we find the peak is very precise?” asked Jill.

“Well, since you have 600 data points, your estimate will be precise. Precision is a function of sample size. Smaller sample sizes will yield less precise estimates.” Joe said uninterrupted from his writing.

“How do you mean?”

“Okay. Let’s take your example. The log-likelihood function for Bernoulli distribution. ln(L) = rln(p)+(n-r)ln(1-p). \hat{p}=\frac{r}{n} is the maximum likelihood estimate of p. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample. However, for smaller sample sizes, the parameter is not estimated very precisely.”

“Can you illustrate it?”

“Sure. Give me a second.” Joe types up a few lines of code in his RStudio console and creates a figure. “Look at this,” he said.

“This graphic shows the log-likelihood (L_{1}) for various values of p and a sample size n = 10. Imagine you had a sample of 10 observations with 6 or 7 violations. The peak of the function is at p = 0.6 or so, but around that region, the function is flat. So, pinpointing the peak at 0.62 is not possible. Less precise estimate.”

“To see this better, I computed the slope of the points from the maximum value of the log-likelihood. The difference in the log-likelihood at every point from the peak. This difference (slope) is another function that shows the precision. At the best estimate, 0.6 ish, the slope function should be very steep to say that the estimate is precise. The function is peaky at the best estimate. If the slope function is flat, the estimate is less precise.”

“I created this difference function for n = 20, 50, and 200 to see how the slope changes with increasing sample size. This is the graphic you see here with four different function lines. Can you see any pattern?”

Jill squints his eyes as he looks at the figures. “Yes, I see the slope function is flat for n = 10 and gets peaky as n increases. It is very steep at 0.62 for n = 200. The more sample we have, the more precise estimates we can get.”

“Should I include this too when I talk tomorrow?”

“Why not. The more information you present, the better. It shows your thought process and rigor.”

“Also, show that the maximum likelihood estimation method has an invariance property.

“What is that?”

“If \hat{\theta} is the maximum likelihood estimator for a parameter \theta, then the maximum likelihood estimator of any function g(\theta) is g(\hat{\theta}), the same function g of the estimator.”

“Think about variance and standard deviation for example. We know that the maximum likelihood estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Standard deviation \sigma = \sqrt{\sigma^{2}}, is a function of the variance. So, the maximum likelihood estimator for standard deviation is \hat{\sigma}=\sqrt{\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}}.”

“Since you are having large sample sizes, while you discuss properties, you can throw in the fact that the maximum likelihood method produces estimates (\hat{\theta}) that are approximately normally distributed, and approximately unbiased estimates with minimum variance.”

“Wait, you just threw in a few new words like bias and minimum variance, and you are asking me to throw it on my boss!”

Jill looks clearly lost. It may be late in the night also. They both go back to doing their works, hoping to sail through hurricane FINALS.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 64 – More Likelihood

J: The maximum likelihood estimation method provides the formulation for computing the parameters of the population distribution which agree most closely with the sample observations. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], the optimal parameter maximizes the likelihood (joint probability of occurrence) of observing the sample. Last week, we derived the parameters for the Binomial and the Poisson, both discrete distributions. I wonder if we can use the same logic for continuous distributions?

D: Hello Joe. You look recharged and ready to go. So let’s cut to the chase. Yes, we can apply the same joint probability logic to the continuous probability functions. Let’s take a random sample X = [x_{1}, x_{2}, ..., x_{n}] from a population with a probability density function f_{X}(x) as in this figure.

Assume we can divide the space into equal intervals of length dx. Can you tell me the probability that a given sample data point x_{1} falls in the range dx?

J: That would be the integral of f_{X}(x_{1})dx over the range.

D: Correct. For an infinitesimally small range dx, we can approximate this area as f_{X}(x_{1})dx. Since the sample observations are independent and identically distributed, the joint probability of observing them, or the likelihood function is

L = f_{X}(x_{1})dx*f_{X}(x_{2})dx*...*f_{X}(x_{n})dx

L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})}(dx)^{n}

Maximizing the above function is equivalent to maximizing L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})} since we have equal intervals of length dx.

J: I see. So, it reduced to the same likelihood we did last week; the product of the probability density function for different values of the sample. We can choose the value of the parameter(s) that maximizes this quantity.

D: Precisely. Here are Fisher’s own words from his paper.

J:  Shall we try some continuous distributions today?

D: We shall. Begin with Exponential Distribution.

J: Let me solve it. The probability function for the exponential distribution is f_{X}(x) = \lambda e^{- \lambda x}. So the likelihood function is

L = {\displaystyle \prod_{i=1}^{n}\lambda e^{- \lambda x_{i}}

L = \lambda^{n}e^{-\lambda (x_{1}+x_{2}+...+x_{n})}

The log-likelihood will be

ln(L) = nln(\lambda) - \lambda (x_{1}+x_{2}+...+x_{n})

\lambda is the parameter we are trying to estimate for maximum likelihood. So take the derivate of the function ln(L) with respect to \lambda and equate it to 0 to solve for \lambda.

\frac{d ln(L)}{d \lambda}=0

\frac{n}{\lambda} - \sum_{i=1}^{n}x_{i}=0

\lambda = \frac{1}{1/n \sum_{i=1}^{n}x_{i}} = \frac{1}{\bar{x}}=\frac{1}{E[X]}

The maximum likelihood estimator for the exponential distribution is \hat{\lambda} = \frac{1}{\bar{x}}.

D: Excellent. Another continuous function you can try is the Rayleigh density. It is the probability distribution of the distance r = \sqrt{x^{2}+y^{2}} from the origin to a point with coordinates (x , y) in an x-y plane. Both X and Y should have a normal distribution.

f(r) = \frac{r}{\theta}e^{-\frac{r^{2}}{2 \theta}}

\theta is the parameter.

Do you recognize this function? We learned a probability distribution which is similar to this.

J: Aaah… Chi-square? Sum of squares of normals.

D: Correct. Rayleigh is the distribution of the distance. The square root of the Chi-square distribution.

J: Let me go through the steps of estimating \theta using the maximum likelihood method.

L = {\displaystyle \prod_{i=1}^{n}\frac{r_{i}}{\theta}e^{-\frac{r_{i}^{2}}{2\theta}}

L = \frac{1}{\theta^{n}} e^{- \frac{1}{2 \theta}\sum_{i=1}^{n}r_{i}^{2}} {\displaystyle \prod_{i=1}^{n}r_{i}}

ln(L) = -nln(\theta) + {\displaystyle \sum_{i=1}^{n}ln(r_{i})}-\frac{1}{2 \theta}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

We have one parameter to estimate, so \frac{d ln(L)}{d \theta}=0.

-\frac{n}{\theta}+\frac{1}{2 \theta^{2}}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}=0

\theta = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

The maximum likelihood estimator for the Rayleigh distribution is \hat{\theta} = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

D: Very well done. Now let’s flex our muscles. How can we estimate two or more parameters?

J: Like for example, the parameters of the normal distribution, \mu and \sigma?

D: Yes.

J:

D: Simple extension. Since there will be two parameters in the likelihood function of a normal distribution, we have to solve the set of equations derived through the partial derivates. Look at these steps for the normal distribution.

f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}

{\displaystyle \prod_{i=1}^{n}} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x_{i}-\mu}{\sigma})^{2}}

L = (2\pi)^{-n/2}\sigma^{-n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}}

ln(L)=-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

Now we can have two equations based on the partial derivatives of the log-likelihood function.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial ln(L)}{\partial \sigma}=0

Let’s take the first partial derivative.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) =0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}({\displaystyle \sum_{i=1}^{n} x_{i}^{2}} + {\displaystyle \sum_{i=1}^{n}\mu^{2}} - 2\mu{\displaystyle \sum_{i=1}^{n}x_{i}})) = 0

-\frac{1}{2\sigma^{2}} {\displaystyle \sum_{i=1}^{n}2\mu} + \frac{1}{\sigma^{2}} {\displaystyle \sum_{i=1}^{n}x_{i}}=0

n\mu = {\displaystyle \sum_{i=1}^{n}x_{i}}

\mu = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

Now, take the second partial derivative.

\frac{\partial ln(L)}{\partial \sigma}=0

\frac{\partial}{\partial \sigma}(-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) = 0

-\frac{n}{\sigma}+\frac{1}{\sigma^{3}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}=0

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

J: That makes it very clear. There is some work involved in solving, but the procedure is straightforward.

D: You can try estimating the parameters of the lognormal distribution for practice.

J: Well, lognormal has a very similar probability density function. Let me try Gumbel. It has a double exponential distribution. Will be fun to solve.

D: Okay. Go for it.

J: The cumulative function for the Gumbel distribution is F(z)=e^{-e^{-\frac{z-\alpha}{\beta}}}. \alpha and \beta are the parameters.

Taking the derivative of this function, we can get the probability density function for the Gumbel distribution as

f(z)=\frac{1}{\beta}e^{-\frac{z-\alpha}{\beta}}e^{-e^{-\frac{z-\alpha}{\beta}}}

The likelihood function is

L = {\displaystyle \prod_{i=1}^{n} \frac{1}{\beta}e^{-\frac{z_{i}-\alpha}{\beta}}e^{-e^{-\frac{z_{i}-\alpha}{\beta}}}}

L = \beta^{-n} e^{-{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }}

ln(L) = -nln(\beta) -{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }

We can get the partial derivatives with respect to \alpha and \beta and solve the system of equations.

\frac{\partial ln(L)}{\partial \alpha} = \frac{n}{\beta} - \frac{1}{\beta}{\displaystyle \sum_{i=1}^{n}e^{-\frac{z_{i}-\alpha}{\beta}}}=0

\frac{\partial ln(L)}{\partial \beta} = -\frac{n}{\beta} + \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha)} - \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha) e^{-\frac{z_{i}-\alpha}{\beta}}}=0

D: Yes, I know. There is no easy solution for this system. You have to use iterative numerical techniques. Do you remember Newton-Raphson method?

J:

D: It’s okay if you do not remember. We will let R do it for us. The moral is that MLE method is not a panacea. We also get into issues of discontinuity for likelihood functions sometimes. Think uniform distribution for example. As you know, we cannot use calculus methods for discontinuous functions. A combination of the MLE method with numerical techniques and approximations is a good approach.

J: I can see how things are building up. Are there any such pitfalls I need to pay attention to?

D: Yes, there are some properties, limitations and real-world applications we can learn. Do you want to do it now?

J: Let’s do it next week. I am afraid these equations are going to be my sweet dreams tonight

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 63 – Likelihood: modus operandi for estimation

D: Often interchangeably used for probability (guilty as charged), the term likelihood is to be used for inferring the parameters of the population. 

J: Are we talking about using a sample to compute the true value of the parameter? You said last week that there are methods of estimation of these parameters. I was able to grasp that there are sample records, i.e., the data we have at hand, and there is a hypothetical population from where these samples originate.

D: That is correct. As you remember, there was an example of extreme wind speed data (sample). Our goal is to understand this bulk data using some probability function (GEV in this case) with few control parameters. The probability function represents the entire information contained in the sample data. It is the hypothetical population you mentioned, and the data we observe is a random sample from this population.

If you compute the mean (\bar{x}) and variance (s^{2}) of the sample data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population. The formulation used to get the estimate is called an estimator.

J: Then what is likelihood? You are saying that it is a well-known method for estimation.

D: It is the joint probability of occurrence of the observations in any distribution. But before we delve into the mathematical expose, I think it will help if we take a simple example and work it out step by step using some graphics.

J: I like that idea. It helps if we know how the concept builds up, so we can go back to its roots when required.

D: So let’s begin. Imagine you ordered ten shirts from Amazon. Just imagine. Don’t even think of actually doing it 

J:

D: On arrival, you find that the fourth (red shirt), fifth (black shirt) and eighth (white shirt) is defective.

J:

D: Can you convert these ten observations into a Bernoulli sequence?

J: Sure. I will use 1 for defective shirts and 0 for shirts that are not defective. Then, we have

X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0].

D: Great. Now, if these numbers (sample shirts) were coming from a hypothetical population (Bernoulli trials), we can say that there is a certain probability of defective items. Let’s call it p.

J: So, p will be the true probability of occurrence for this Bernoulli trial. What we have is a sample ten values of 0’s and 1’s from this population.

D: Correct. Can you estimate p using the sample data?

J: Based on the ten data points one guess would be 0.3 since there are three defective items in ten. But, how do we know if 0.3 is the best estimate of the true probability?

D: Good question. For that, we need to know about likelihood. The joint probability of observing the sample we have.

J: Can you elaborate.

D: You told me that the Bernoulli sequence of these ten observations is X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]. Can you tell me what the joint probability of observing this sequence is?

J: Sure. If we assume p for the probability of success, P(X = 1) = p and P(X = 0) = 1-p. Then, the joint probability of seeing X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0] is

P(X=0)*P(X=0)*P(X=0)*P(X=1)*P(X=1)*P(X=0)*P(X=0)*P(X=1)*P(X=0)*P(X=0)=p^{3}(1-p)^{7}.

They are independent and identically distributed, so the joint probability is the product of the individual probabilities.

D: Great. Let’s write it concisely as L = p^{3}(1-p)^{7}.

I am using L to describe the likelihood.

We can now choose a value for p that will maximize L, i.e., that value of p that gives the most L. The brute-force way of doing it is to substitute possibles values of p, compute L and find the value of p where L is the highest.

J: Yes. Let me try that first.

L is maximum at \hat{p}=0.3.

D: That is correct. Now let’s try solving it as a maximization problem. You have the equation L = p^{3}(1-p)^{7}. How do you find p that maximizes L?

J: We can differentiate L. At the maximum or minimum, the slope \frac{dL}{dp}=0. We can use this property to solve for p.

D: Yes. But it is easier to differentiate ln(L).

J: Why is that?

D: Because ln(L) is a monotonic function of L. The largest value of L is proportional to the largest value of ln(L).

Finding p that maximizes ln(L) is equivalent to finding p that maximizes L. Moreover, taking the logarithm of the likelihood function converts the product terms into summation terms.

ln(L) = 3ln(p)+7ln(1-p)

Now we can find the derivative of the function and equate it to 0 to find the answer.

\frac{3}{p} - \frac{7}{1-p}=0

3 - 3p = 7p

p = \frac{3}{10}

Can you generalize this for any number of samples?

J: Yes, I can do that. Assuming there are r 1’s in a sample of n, the likelihood function will be L = p^{r}(1-p)^{n-r}

Log-likelihood function is ln(L) = rln(p)+(n-r)ln(1-p)

\frac{dln(L)}{dp} = 0

\frac{r}{p}-\frac{n-r}{1-p}=0

p=\frac{r}{n}

\frac{d^{2}ln(L)}{dp^{2}} = -[\frac{r}{p^{2}} + \frac{n-r}{(1-p)^{2}}]

A negative second derivative is sufficient to ensure maxima of the function.

D: Very nice. \hat{p}=\frac{r}{n} is called the maximum likelihood estimate of p. For a given sample  X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample.

R. A. Fisher pioneered this method in the 1920s. The opening image on likelihood is from his 1922 paper “On the Mathematical Foundations of Theoretical Statistics” in the Philosophical Transactions of the Royal Society of London where he first proposed this method.

Here is how L and ln(L) compare when finding the value of p that gives the most likelihood or log-likelihood.

J: That makes it very clear. Some form of visualization always helps better grasp the abstractions. So, this is for Bernoulli sequence, or by extension for the Binomial distribution. Shall we try this maximum likelihood estimation method for one more distribution?

D: The floor is yours. Try it for the Poisson distribution.

J: I will start with the fact that there is a sample of n observations x_{1}, x_{2}, ..., x_{n} that is assumed to have originated from a Poisson distribution with the probability function f_{X}(x) = P(X=x) = \frac{e^{-\lambda t}(\lambda t)^{x}}{x!}. Our sample is counts, i.e., the number of times an event occurs in an interval. Let’s assume a unit time interval, t = 1.

The likelihood L of observing the sample x_{1}, x_{2}, ..., x_{n} is P(X=x_{1})*P(X=x_{2}) ... P(X=x_{n}). We can represent this concisely using the product operator \prod.

L = {\displaystyle \prod_{i=1}^{n} \frac{e^{-\lambda}(\lambda)^{x_{i}}}{x_{i}!}}

L = e^{-n\lambda}{\displaystyle \prod_{i=1}^{n} \frac{(\lambda)^{x_{i}}}{x_{i}!}}

ln(L) = -n\lambda + {\displaystyle \sum_{i=1}^{n} (x_{i}ln(\lambda) - ln(x_{i}!))}

\frac{dln(L)}{d\lambda} = 0

-n + {\displaystyle \sum_{i=1}^{n}\frac{x_{i}}{\lambda} = 0

\lambda = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\frac{d^{2}ln(L)}{d\lambda^{2}} < 0

The maximum likelihood estimator is \hat{\lambda} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. It is the mean of the n observations in the sample. Ah, I see a connection. The expected value of the Poisson distribution E[X] is its parameter \lambda.

D: Very clearly described. Do you have the energy for one more?

J: Most likely no.

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)