Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a probability model.

Given a probability model \(P(x|\theta)\) and a data set \(\{x_i\}_{i=1}^n\) *identically and independently* sampled from some underlying *true model* \(P(x|\theta^*)\), the goal of the MLE algorithm is to give some estimation \(\hat{\theta}\) of the *true parameter* \(\theta^*\).

# Algorithm

Define the *likelihood* of the given data set as with respect to the model \(P(x|\theta)\) as

\[ \prod_{i=1}^n P(x_i|\theta) \]

The MLE estimation of the parameter is obtained as

\[ \hat{\theta} = \operatorname*{arg max}_\theta \prod_{i=1}^n P(x_i|\theta) \]

In order to avoid numerical underflow when computing, the algorithm is usually carried out in the log space. Since the \(\log\) function is strictly monotonically increasing, the following equation is equivalent to ((r:eq:mle-obj))

\[ \hat{\theta} = \operatorname*{arg max}_\theta \sum_{i=1}^n \log P(x_i|\theta) \]

# Rationale

## KL Divergence

MLE can be interpreted as minimizing the KL Divergence from the true model \(P(x|\theta^*)\) to the estimated model \(P(x|\hat{\theta})\). Kullback-Leibler Divergence (KL Divergence) is a non-symmetric measure of the difference between two probability distributions, defined as

\[ D\left[ P(\cdot|\theta^*) \Big\vert P(\cdot|\hat{\theta}) \right] = \mathbb{E}_{x\sim P(\cdot|\theta^*)} \left[ \log\left( \frac{P(x|\theta^*)}{P(x|\hat{\theta})}\right) \right] \]

It is easy to show that

\[ \begin{aligned} D\left[ P(\cdot|\theta^*) \Big\vert P(\cdot|\hat{\theta}) \right] &= \mathbb{E}_{x\sim P(\cdot|\theta^*)}\left[ \log P(x|\theta^*) \right] - \mathbb{E}_{x\sim P(\cdot|\theta^*)}\left[ P(x|\hat{\theta})\right] \\ &= \color{blue}{\mathbb{H}\left[ P(x|\theta^*)\right]} - \color{red}{\mathbb{E}_{x\sim P(\cdot|\theta^*)}\left[ P(x|\hat{\theta})\right]} \end{aligned} \]

The red part is the entropy of the true distribution \(P(x|\theta^*)\), which is constant with respect to our estimation \(\hat{\theta}\), thus could be safely ignored. So in order to minimize the KL Divergence, we can simply maximize the red part shown above. Finally, we approximate the expectation with respect to the unknown distribution \(P(x|\theta^*)\) with the empirical expectation defined by the data set \(\{x_i\}_{i=1}^n\):

\[ \mathbb{E}_{x\sim P(\cdot|\theta^*)}\left[ P(x|\hat{\theta})\right] \approx \sum_{i=1}^n \frac{1}{n} \log P(x_i|\hat{\theta}) \]

At this stage, we get exactly the formula for MLE.