Lesson 64 – More Likelihood

J: The maximum likelihood estimation method provides the formulation for computing the parameters of the population distribution which agree most closely with the sample observations. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], the optimal parameter maximizes the likelihood (joint probability of occurrence) of observing the sample. Last week, we derived the parameters for the Binomial and the Poisson, both discrete distributions. I wonder if we can use the same logic for continuous distributions?

D: Hello Joe. You look recharged and ready to go. So let’s cut to the chase. Yes, we can apply the same joint probability logic to the continuous probability functions. Let’s take a random sample X = [x_{1}, x_{2}, ..., x_{n}] from a population with a probability density function f_{X}(x) as in this figure.

Assume we can divide the space into equal intervals of length dx. Can you tell me the probability that a given sample data point x_{1} falls in the range dx?

J: That would be the integral of f_{X}(x_{1})dx over the range.

D: Correct. For an infinitesimally small range dx, we can approximate this area as f_{X}(x_{1})dx. Since the sample observations are independent and identically distributed, the joint probability of observing them, or the likelihood function is

L = f_{X}(x_{1})dx*f_{X}(x_{2})dx*...*f_{X}(x_{n})dx

L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})}(dx)^{n}

Maximizing the above function is equivalent to maximizing L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})} since we have equal intervals of length dx.

J: I see. So, it reduced to the same likelihood we did last week; the product of the probability density function for different values of the sample. We can choose the value of the parameter(s) that maximizes this quantity.

D: Precisely. Here are Fisher’s own words from his paper.

J:  Shall we try some continuous distributions today?

D: We shall. Begin with Exponential Distribution.

J: Let me solve it. The probability function for the exponential distribution is f_{X}(x) = \lambda e^{- \lambda x}. So the likelihood function is

L = {\displaystyle \prod_{i=1}^{n}\lambda e^{- \lambda x_{i}}

L = \lambda^{n}e^{-\lambda (x_{1}+x_{2}+...+x_{n})}

The log-likelihood will be

ln(L) = nln(\lambda) - \lambda (x_{1}+x_{2}+...+x_{n})

\lambda is the parameter we are trying to estimate for maximum likelihood. So take the derivate of the function ln(L) with respect to \lambda and equate it to 0 to solve for \lambda.

\frac{d ln(L)}{d \lambda}=0

\frac{n}{\lambda} - \sum_{i=1}^{n}x_{i}=0

\lambda = \frac{1}{1/n \sum_{i=1}^{n}x_{i}} = \frac{1}{\bar{x}}=\frac{1}{E[X]}

The maximum likelihood estimator for the exponential distribution is \hat{\lambda} = \frac{1}{\bar{x}}.

D: Excellent. Another continuous function you can try is the Rayleigh density. It is the probability distribution of the distance r = \sqrt{x^{2}+y^{2}} from the origin to a point with coordinates (x , y) in an x-y plane. Both X and Y should have a normal distribution.

f(r) = \frac{r}{\theta}e^{-\frac{r^{2}}{2 \theta}}

\theta is the parameter.

Do you recognize this function? We learned a probability distribution which is similar to this.

J: Aaah… Chi-square? Sum of squares of normals.

D: Correct. Rayleigh is the distribution of the distance. The square root of the Chi-square distribution.

J: Let me go through the steps of estimating \theta using the maximum likelihood method.

L = {\displaystyle \prod_{i=1}^{n}\frac{r_{i}}{\theta}e^{-\frac{r_{i}^{2}}{2\theta}}

L = \frac{1}{\theta^{n}} e^{- \frac{1}{2 \theta}\sum_{i=1}^{n}r_{i}^{2}} {\displaystyle \prod_{i=1}^{n}r_{i}}

ln(L) = -nln(\theta) + {\displaystyle \sum_{i=1}^{n}ln(r_{i})}-\frac{1}{2 \theta}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

We have one parameter to estimate, so \frac{d ln(L)}{d \theta}=0.

-\frac{n}{\theta}+\frac{1}{2 \theta^{2}}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}=0

\theta = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

The maximum likelihood estimator for the Rayleigh distribution is \hat{\theta} = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

D: Very well done. Now let’s flex our muscles. How can we estimate two or more parameters?

J: Like for example, the parameters of the normal distribution, \mu and \sigma?

D: Yes.

J:

D: Simple extension. Since there will be two parameters in the likelihood function of a normal distribution, we have to solve the set of equations derived through the partial derivates. Look at these steps for the normal distribution.

f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}

{\displaystyle \prod_{i=1}^{n}} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x_{i}-\mu}{\sigma})^{2}}

L = (2\pi)^{-n/2}\sigma^{-n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}}

ln(L)=-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

Now we can have two equations based on the partial derivatives of the log-likelihood function.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial ln(L)}{\partial \sigma}=0

Let’s take the first partial derivative.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) =0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}({\displaystyle \sum_{i=1}^{n} x_{i}^{2}} + {\displaystyle \sum_{i=1}^{n}\mu^{2}} - 2\mu{\displaystyle \sum_{i=1}^{n}x_{i}})) = 0

-\frac{1}{2\sigma^{2}} {\displaystyle \sum_{i=1}^{n}2\mu} + \frac{1}{\sigma^{2}} {\displaystyle \sum_{i=1}^{n}x_{i}}=0

n\mu = {\displaystyle \sum_{i=1}^{n}x_{i}}

\mu = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

Now, take the second partial derivative.

\frac{\partial ln(L)}{\partial \sigma}=0

\frac{\partial}{\partial \sigma}(-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) = 0

-\frac{n}{\sigma}+\frac{1}{\sigma^{3}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}=0

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

J: That makes it very clear. There is some work involved in solving, but the procedure is straightforward.

D: You can try estimating the parameters of the lognormal distribution for practice.

J: Well, lognormal has a very similar probability density function. Let me try Gumbel. It has a double exponential distribution. Will be fun to solve.

D: Okay. Go for it.

J: The cumulative function for the Gumbel distribution is F(z)=e^{-e^{-\frac{z-\alpha}{\beta}}}. \alpha and \beta are the parameters.

Taking the derivative of this function, we can get the probability density function for the Gumbel distribution as

f(z)=\frac{1}{\beta}e^{-\frac{z-\alpha}{\beta}}e^{-e^{-\frac{z-\alpha}{\beta}}}

The likelihood function is

L = {\displaystyle \prod_{i=1}^{n} \frac{1}{\beta}e^{-\frac{z_{i}-\alpha}{\beta}}e^{-e^{-\frac{z_{i}-\alpha}{\beta}}}}

L = \beta^{-n} e^{-{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }}

ln(L) = -nln(\beta) -{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }

We can get the partial derivatives with respect to \alpha and \beta and solve the system of equations.

\frac{\partial ln(L)}{\partial \alpha} = \frac{n}{\beta} - \frac{1}{\beta}{\displaystyle \sum_{i=1}^{n}e^{-\frac{z_{i}-\alpha}{\beta}}}=0

\frac{\partial ln(L)}{\partial \beta} = -\frac{n}{\beta} + \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha)} - \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha) e^{-\frac{z_{i}-\alpha}{\beta}}}=0

D: Yes, I know. There is no easy solution for this system. You have to use iterative numerical techniques. Do you remember Newton-Raphson method?

J:

D: It’s okay if you do not remember. We will let R do it for us. The moral is that MLE method is not a panacea. We also get into issues of discontinuity for likelihood functions sometimes. Think uniform distribution for example. As you know, we cannot use calculus methods for discontinuous functions. A combination of the MLE method with numerical techniques and approximations is a good approach.

J: I can see how things are building up. Are there any such pitfalls I need to pay attention to?

D: Yes, there are some properties, limitations and real-world applications we can learn. Do you want to do it now?

J: Let’s do it next week. I am afraid these equations are going to be my sweet dreams tonight

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)