Lesson 63 – Likelihood: modus operandi for estimation

D: Often interchangeably used for probability (guilty as charged), the term likelihood is to be used for inferring the parameters of the population. 

J: Are we talking about using a sample to compute the true value of the parameter? You said last week that there are methods of estimation of these parameters. I was able to grasp that there are sample records, i.e., the data we have at hand, and there is a hypothetical population from where these samples originate.

D: That is correct. As you remember, there was an example of extreme wind speed data (sample). Our goal is to understand this bulk data using some probability function (GEV in this case) with few control parameters. The probability function represents the entire information contained in the sample data. It is the hypothetical population you mentioned, and the data we observe is a random sample from this population.

If you compute the mean (\bar{x}) and variance (s^{2}) of the sample data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population. The formulation used to get the estimate is called an estimator.

J: Then what is likelihood? You are saying that it is a well-known method for estimation.

D: It is the joint probability of occurrence of the observations in any distribution. But before we delve into the mathematical expose, I think it will help if we take a simple example and work it out step by step using some graphics.

J: I like that idea. It helps if we know how the concept builds up, so we can go back to its roots when required.

D: So let’s begin. Imagine you ordered ten shirts from Amazon. Just imagine. Don’t even think of actually doing it 

J:

D: On arrival, you find that the fourth (red shirt), fifth (black shirt) and eighth (white shirt) is defective.

J:

D: Can you convert these ten observations into a Bernoulli sequence?

J: Sure. I will use 1 for defective shirts and 0 for shirts that are not defective. Then, we have

X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0].

D: Great. Now, if these numbers (sample shirts) were coming from a hypothetical population (Bernoulli trials), we can say that there is a certain probability of defective items. Let’s call it p.

J: So, p will be the true probability of occurrence for this Bernoulli trial. What we have is a sample ten values of 0’s and 1’s from this population.

D: Correct. Can you estimate p using the sample data?

J: Based on the ten data points one guess would be 0.3 since there are three defective items in ten. But, how do we know if 0.3 is the best estimate of the true probability?

D: Good question. For that, we need to know about likelihood. The joint probability of observing the sample we have.

J: Can you elaborate.

D: You told me that the Bernoulli sequence of these ten observations is X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]. Can you tell me what the joint probability of observing this sequence is?

J: Sure. If we assume p for the probability of success, P(X = 1) = p and P(X = 0) = 1-p. Then, the joint probability of seeing X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0] is

P(X=0)*P(X=0)*P(X=0)*P(X=1)*P(X=1)*P(X=0)*P(X=0)*P(X=1)*P(X=0)*P(X=0)=p^{3}(1-p)^{7}.

They are independent and identically distributed, so the joint probability is the product of the individual probabilities.

D: Great. Let’s write it concisely as L = p^{3}(1-p)^{7}.

I am using L to describe the likelihood.

We can now choose a value for p that will maximize L, i.e., that value of p that gives the most L. The brute-force way of doing it is to substitute possibles values of p, compute L and find the value of p where L is the highest.

J: Yes. Let me try that first.

L is maximum at \hat{p}=0.3.

D: That is correct. Now let’s try solving it as a maximization problem. You have the equation L = p^{3}(1-p)^{7}. How do you find p that maximizes L?

J: We can differentiate L. At the maximum or minimum, the slope \frac{dL}{dp}=0. We can use this property to solve for p.

D: Yes. But it is easier to differentiate ln(L).

J: Why is that?

D: Because ln(L) is a monotonic function of L. The largest value of L is proportional to the largest value of ln(L).

Finding p that maximizes ln(L) is equivalent to finding p that maximizes L. Moreover, taking the logarithm of the likelihood function converts the product terms into summation terms.

ln(L) = 3ln(p)+7ln(1-p)

Now we can find the derivative of the function and equate it to 0 to find the answer.

\frac{3}{p} - \frac{7}{1-p}=0

3 - 3p = 7p

p = \frac{3}{10}

Can you generalize this for any number of samples?

J: Yes, I can do that. Assuming there are r 1’s in a sample of n, the likelihood function will be L = p^{r}(1-p)^{n-r}

Log-likelihood function is ln(L) = rln(p)+(n-r)ln(1-p)

\frac{dln(L)}{dp} = 0

\frac{r}{p}-\frac{n-r}{1-p}=0

p=\frac{r}{n}

\frac{d^{2}ln(L)}{dp^{2}} = -[\frac{r}{p^{2}} + \frac{n-r}{(1-p)^{2}}]

A negative second derivative is sufficient to ensure maxima of the function.

D: Very nice. \hat{p}=\frac{r}{n} is called the maximum likelihood estimate of p. For a given sample  X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample.

R. A. Fisher pioneered this method in the 1920s. The opening image on likelihood is from his 1922 paper “On the Mathematical Foundations of Theoretical Statistics” in the Philosophical Transactions of the Royal Society of London where he first proposed this method.

Here is how L and ln(L) compare when finding the value of p that gives the most likelihood or log-likelihood.

J: That makes it very clear. Some form of visualization always helps better grasp the abstractions. So, this is for Bernoulli sequence, or by extension for the Binomial distribution. Shall we try this maximum likelihood estimation method for one more distribution?

D: The floor is yours. Try it for the Poisson distribution.

J: I will start with the fact that there is a sample of n observations x_{1}, x_{2}, ..., x_{n} that is assumed to have originated from a Poisson distribution with the probability function f_{X}(x) = P(X=x) = \frac{e^{-\lambda t}(\lambda t)^{x}}{x!}. Our sample is counts, i.e., the number of times an event occurs in an interval. Let’s assume a unit time interval, t = 1.

The likelihood L of observing the sample x_{1}, x_{2}, ..., x_{n} is P(X=x_{1})*P(X=x_{2}) ... P(X=x_{n}). We can represent this concisely using the product operator \prod.

L = {\displaystyle \prod_{i=1}^{n} \frac{e^{-\lambda}(\lambda)^{x_{i}}}{x_{i}!}}

L = e^{-n\lambda}{\displaystyle \prod_{i=1}^{n} \frac{(\lambda)^{x_{i}}}{x_{i}!}}

ln(L) = -n\lambda + {\displaystyle \sum_{i=1}^{n} (x_{i}ln(\lambda) - ln(x_{i}!))}

\frac{dln(L)}{d\lambda} = 0

-n + {\displaystyle \sum_{i=1}^{n}\frac{x_{i}}{\lambda} = 0

\lambda = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\frac{d^{2}ln(L)}{d\lambda^{2}} < 0

The maximum likelihood estimator is \hat{\lambda} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. It is the mean of the n observations in the sample. Ah, I see a connection. The expected value of the Poisson distribution E[X] is its parameter \lambda.

D: Very clearly described. Do you have the energy for one more?

J: Most likely no.

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

One thought on “Lesson 63 – Likelihood: modus operandi for estimation”

  1. Can there be an audio book version please? I’d like to listen to this as I fall asleep.

Comments are closed.

error

Enjoy this blog? Please spread the word :)