Shrinkage with Penalized Likelihood
By setting \(|\Sigma| = 1\) we ensure that the covariance matrix is numerically invertible in each iteration of the EM algorithm. However, this does not guarantee that the matrix inversion is well-conditioned. The formula for \(\Sigma_{k+1}\) in (3) has the same problem as the sample covariance: the condition number of \(\Sigma_k\) can be huge when the sample size is relatively small. Shrinkage estimators based on penalized likelihood can improve the conditioning.
Penalized Likelihood Framework
Consider a general exponential family with density:
where
The Kullback-Leibler divergence between two members of this family is:
where \(\eta_1 = E[t(X, Y) | \theta_1] = \nabla\psi(\theta_1)\). This is also the Bregman divergence with potential function \(\psi\).
Assume that \(x\) is observable while \(y\) is hidden. Given a sample \(x_1, \ldots, x_n\) and a prior parameter \(\theta_0\), the penalized likelihood maximization is:
where \(f(x|\theta) = \int f(x,y|\theta) \, dy\) is the marginal density and \(\tau \geq 0\) controls the amount of shrinkage. When \(\tau = 0\) this reduces to standard maximum likelihood. As \(\tau\) increases, the solution is pulled toward \(\theta_0\).
Penalized EM Algorithm
This problem can be solved iteratively by the following EM algorithm:
One can show that the penalized likelihood is non-decreasing:
since the difference includes a non-negative KL divergence between conditional distributions \(f(y|x,\theta_k)\) and \(f(y|x,\theta_{k+1})\).
Using the exponential family representation, the penalized M-step reduces to:
where \(\eta_0 = E[t(X, Y) | \theta_0]\) are the expectation parameters at the prior.
Shrunk Sufficient Statistics
As a result, the E-step computes shrunk sufficient statistics by taking a convex combination of the conditional expectations and the prior expectations:
where the prior expectations are computed from the GH expectation parameters (see (6)):
The penalized maximum likelihood thus amounts to a linear shrinkage of the conditional expectation parameters toward the prior \(\theta_0\). The M-step is identical to the standard EM algorithm (see (3)).
Furthermore, the linear relationship between \(\Sigma_{k+1}\) and \(\hat{\eta}_6^{(k)}\) implies that \(\Sigma_{k+1}\) is shrunk toward \(\Sigma_0\) directly. By choosing an appropriate \(\Sigma_0\) (e.g., a well-conditioned target), one can improve the conditioning of \(\Sigma_{k+1}\) at each iteration.