Online EM Algorithm#

In this section we review the online EM algorithm for exponential families with hidden data [Cappe2009], and apply it to the Generalized Hyperbolic distribution.

Online EM for Exponential Families#

Consider the exponential family with density:

\[f(x, y | \theta) = h(x, y) \exp\!\left(\theta^\top t(x, y) - \psi(\theta)\right),\]

where \(x\) is observable, \(y\) is hidden, and

\[\psi(\theta) = \log \int h(x, y) \exp\!\left(\theta^\top t(x, y)\right) dx \, dy.\]

The expectation parameter is \(\eta = \nabla\psi(\theta)\), and the Legendre dual of \(\psi\) is \(\phi(\eta) = \eta^\top \theta - \psi(\theta)\), with \(\theta = \nabla\phi(\eta)\).

Regret Framework#

The parameter estimation process can be viewed as an online game. At time \(t-1\) we make a prediction \(\theta_{t-1}\). At time \(t\) the environment reveals an observation \(x_t\), and the loss is \(l(x_t, \theta_{t-1}) = -\log f(x_t | \theta_{t-1})\). The regret from \(t = 1\) to \(T\) is:

(1)#\[r_T(\theta_0, \ldots, \theta_{T-1}) = -\sum_{t=1}^T \log f(x_t | \theta_{t-1}) - \min_\theta \left(-\sum_{t=1}^T \log f(x_t | \theta) + \tau_0 \, D_\psi(\theta \| \theta_0)\right),\]

where \(D_\psi(\theta \| \theta_0) = \psi(\theta) - \psi(\theta_0) - \nabla\psi(\theta_0)^\top (\theta - \theta_0)\) is the Bregman divergence, which coincides with the KL divergence \(D_{KL}(\theta_0 \| \theta)\).

Update Rules#

The online EM algorithm updates the expectation parameters as follows. Starting from \(\eta_0 = \nabla\psi(\theta_0)\):

(2)#\[\begin{split}\eta_t &= \eta_{t-1} + \tau_t^{-1} \left(\bar{t}(x_t | \theta_{t-1}) - \eta_{t-1}\right), \\ \theta_t &= \nabla\phi(\eta_t),\end{split}\]

where

\[\bar{t}(x_t | \theta_{t-1}) := E[t(X, Y) | X = x_t, \theta_{t-1}]\]

is the conditional expectation of the sufficient statistics, and \(\tau_t = \tau_0 + t\) is the step size.

Regret Bound#

Theorem. With \(\tau_t = \tau_0 + t\), the regret (1) of the online EM algorithm (2) is:

(3)#\[r_T = \sum_{t=1}^T \tau_t \, D_\psi(\theta_{t-1} \| \theta_t) + \sum_{t=1}^T D_{KL}(x_t, \theta_{t-1} \| x_t, \theta_{ML}) - \tau_T \, D_\phi(\eta_T \| \eta_{ML}),\]

where \(\theta_{ML}\) is the penalized MLE:

\[\theta_{ML} = \arg\min_\theta \left(-\sum_{t=1}^T \log f(x_t | \theta) + \tau_0 \, D_\psi(\theta \| \theta_0)\right),\]

\(\eta_{ML} = \nabla\psi(\theta_{ML})\), and \(D_{KL}(x_t, \theta_{t-1} \| x_t, \theta_{ML})\) is the KL divergence between the conditional densities \(f(y | x_t, \theta_{t-1})\) and \(f(y | x_t, \theta_{ML})\).

Proof. The first part of the regret is:

\[\begin{split}-\sum_{t=1}^T \log f(x_t | \theta_{t-1}) &= -\sum_{t=1}^T E[\log f(X, Y | \theta_{t-1}) | X = x_t, \theta_{t-1}] \\ &\quad + \sum_{t=1}^T E[\log f(Y | X, \theta_{t-1}) | X = x_t, \theta_{t-1}].\end{split}\]

The conditional expectation of the joint log-density satisfies:

\[\begin{split}&\sum_{t=1}^T \left(\psi(\theta_{t-1}) - \theta_{t-1}^\top \bar{t}(x_t | \theta_{t-1})\right) \\ &= \sum_{t=1}^T \left(-\phi(\eta_{t-1}) - \tau_t \theta_{t-1}^\top (\eta_t - \eta_{t-1})\right) \\ &= \sum_{t=1}^T \tau_t D_\phi(\eta_t \| \eta_{t-1}) + \tau_0 \phi(\eta_0) - \tau_T \phi(\eta_T) \\ &= \sum_{t=1}^T \tau_t D_\psi(\theta_{t-1} \| \theta_t) + \tau_0 \phi(\eta_0) - \tau_T \phi(\eta_T),\end{split}\]

using \(\phi(\eta) = \theta^\top \eta - \psi(\theta)\) and \(D_\phi(\eta_t \| \eta_{t-1}) = D_\psi(\theta_{t-1} \| \theta_t)\).

The key identity for the accumulated conditional expectations is:

\[\eta_T = \tau_T^{-1} \left(\tau_0 \eta_0 + \sum_{t=1}^T \bar{t}(x_t | \theta_{t-1})\right).\]

Applying this to the second part of the regret:

\[\begin{split}&-\sum_{t=1}^T E[\log f(X, Y | \theta_{ML}) | X = x_t, \theta_{t-1}] + \tau_0 \, D_\psi(\theta_{ML} \| \theta_0) \\ &= \tau_0 \phi(\eta_0) - \tau_T \phi(\eta_T) + \tau_T D_\phi(\eta_T \| \eta_{ML}).\end{split}\]

Combining both parts yields (3). \(\square\)

Application to the GH Distribution#

For the joint GH distribution (2), which is an exponential family, the online EM algorithm updates the sufficient statistics as follows. Given a sequence of observations \(x_1, x_2, \ldots, x_T\):

\[\begin{split}s_1^{(t+1)} &= s_1^{(t)} + \tau_{t+1}^{-1} \left(E[Y^{-1} | X = x_{t+1}, \theta_t] - s_1^{(t)}\right), \\ s_2^{(t+1)} &= s_2^{(t)} + \tau_{t+1}^{-1} \left(E[Y | X = x_{t+1}, \theta_t] - s_2^{(t)}\right), \\ s_3^{(t+1)} &= s_3^{(t)} + \tau_{t+1}^{-1} \left(E[\log Y | X = x_{t+1}, \theta_t] - s_3^{(t)}\right), \\ s_4^{(t+1)} &= s_4^{(t)} + \tau_{t+1}^{-1} \left(x_{t+1} - s_4^{(t)}\right), \\ s_5^{(t+1)} &= s_5^{(t)} + \tau_{t+1}^{-1} \left(x_{t+1} \, E[Y^{-1} | X = x_{t+1}, \theta_t] - s_5^{(t)}\right), \\ s_6^{(t+1)} &= s_6^{(t)} + \tau_{t+1}^{-1} \left(x_{t+1} x_{t+1}^\top E[Y^{-1} | X = x_{t+1}, \theta_t] - s_6^{(t)}\right),\end{split}\]

where the conditional expectations are given by (1).

The parameters \(\theta_t = (\mu_t, \gamma_t, \Sigma_t, p_t, a_t, b_t)\) are then recovered from the sufficient statistics using the same formulas as the M-step (see (3)):

\[\begin{split}\mu_t &= \frac{s_4^{(t)} - s_2^{(t)} s_5^{(t)}} {1 - s_1^{(t)} s_2^{(t)}}, \\ \gamma_t &= \frac{s_5^{(t)} - s_1^{(t)} s_4^{(t)}} {1 - s_1^{(t)} s_2^{(t)}}, \\ \Sigma_t &= s_6^{(t)} - s_5^{(t)} \mu_t^\top - \mu_t (s_5^{(t)})^\top + s_1^{(t)} \mu_t \mu_t^\top - s_2^{(t)} \gamma_t \gamma_t^\top,\end{split}\]

and

\[(p_t, a_t, b_t) = \arg\max_{p, a, b} L_{GIG}(p, a, b \,|\, s_1^{(t)}, s_2^{(t)}, s_3^{(t)}).\]

The online EM algorithm has comparable convergence rate to the batch EM algorithm, but each step processes only a single observation, making it much faster per iteration. The step size \(\tau_t = \tau_0 + t\) corresponds to starting with an initial weight of \(\tau_0\) “pseudo-observations.”

Limitations for Curved Exponential Families#

The online EM algorithm of [Cappe2009] may not converge for curved exponential families of the form:

\[f(x, y | u) = h(x, y) \exp\!\left(\theta(u)^\top t(x, y) - \psi(\theta(u))\right),\]

where \(\theta(u)\) is a nonlinear mapping projecting the parameter \(u\) to a higher-dimensional space. The batch EM algorithm for curved exponential families solves:

\[\nabla\theta(u)^\top \left(\frac{1}{n} \sum_{j=1}^n E[t(X, Y) | x_j, \theta(u)] - \nabla\psi(\theta(u))\right) = 0,\]

while the online EM tends to solve:

\[\frac{1}{n} \sum_{j=1}^n E[t(X, Y) | x_j, \theta(u)] - \nabla\psi(\theta(u)) = 0,\]

which may not have a solution when \(\dim(u) < \dim(\theta)\). Therefore, the online EM algorithm cannot be directly applied to the factor analysis model described in Factor Analysis for Generalized Hyperbolic Distributions.

References#

[Cappe2009] (1,2)

Cappé, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B, 71(3), 593-613.