6.1 The BFGS Method

The BFGS method is named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno.

We begin with the quadratic model of the objective function at the current iterate \(x_k\):

\[m_k(p) = f_k + \nabla f_k^\top p + \frac{1}{2} p^\top B_k p\]

Here \(B_k\) is an \(n \times n\) symmetric positive definite matrix that will be updated at every iteration. The minimizer \(p_k\) of this model is given by

\[p_k = -B_k^{-1}\nabla f_k\]

and is used as the search direction. The new iterate is

\[x_{k+1} = x_k + \alpha_kp_k\]

where \(\alpha_k\) is chosen to satisfy the Wolfe conditions.

Davidon proposed to update \(B_k\) in a simple manner to account for the curvature measured during the most recent step. Given \(x_{k+1}\), we wish to construct a new quadratic model, of the form

\[m_{k+1}(p) = f_{k+1} + \nabla f_{k+1}^\top p + \frac{1}{2}p^\top B_{k+1}p\]

One resonable requirement of \(B_{k+1}\) is that the gradient of \(m_{k+1}\) should match the gradient of the objective function \(f\) at \(x_k\) and \(x_{k+1}\). Since \(\nabla m_{k+1}(0)\) is precisely \(\nabla f_{k+1}\), we only care about the first condition, written by

\[\nabla m_{k+1}(-\alpha_kp_k) = \nabla f_{k+1} - \alpha_kB_{k+1}p_k = \nabla f_k\]

Then we obtain

\[B_{k+1}\alpha_k p_k = \nabla f_{k+1} - \nabla f_k\]

To simplify the notation, we define the vectors

\[s_k = x_{k+1} - x_k = \alpha_kp_k, \;\;\; y_k = \nabla f_{k+1} - \nabla f_k\]

So we have

\[B_{k+1}s_k = y_k\]

We refer to this formula as the secant equation.

The secant equation requires \(B_k\) maps \(s_k\) into \(y_k\). This is possible only if \(s_k\) and \(y_k\) satisfy the curvature condition

\[s_k^\top y_k > 0\]

This does not always hold especially for nonconvex functions. However, it is guaranteed to hold if we impose the Wolfe or strong Wolfe conditions on the line search.

To determine \(B_k\) uniquely, we impose the additional condition that among all symmetric matrices satisfying the secant equation, :math:`B_k` is, in some sense, closet to the current matrix :math:`B_k`. In other words, we solve the problem

\[\begin{split}\min_B \; & \lVert B - B_k \rVert \\ \text{subject to } \; & B = B^\top, \; Bs_k = y_k\end{split}\]

where \(s_k\) and \(y_k\) satisfy the curvature condition and \(B_k\) is symmtric and positive definite. A matrix norm that allows easy solution of the minimization problem and gives rise to a scale-invariant optimization method is the weighted Frobenius norm

\[\lVert A \rVert_W \equiv \lVert W^{1/2}AW^{1/2} \rVert_F\]

where \(\lVert C \rVert_F = \sum_{i=1}^n \sum_{j=1}^n c_{ij}^2\). The weight matrix can be chosen as any matrix satisfying the relation \(Wy_k = s_k\). We assume that \(W = \bar{G}_k^{-1}\) where \(\bar{G}_k\) is the average Hessian defined by

\[\bar{G}_k = \left[ \int_0^1 \nabla^2 f(x_k + \tau\alpha_kp_k)d\tau \right]\]

The property

\[y_k = \bar{G}_k\alpha_kp_k = \bar{G}_ks_k\]

follows from Taylor’s theorem. With this weighting matrix and this norm, the unique solution is

\[\text{(DFP)} \;\;\; B_{k+1} = (I - \rho_ky_ks_k^\top)B_k(I - \rho_ks_ky_k^\top) + \rho_ky_ky_k^\top\]

with

\[\rho_k = \frac{1}{y_k^\top s_k}\]

This formula is called the DFP updating formula.

The inverse of \(B_k\), denoted by \(H_k = B_k^{-1}\) is useful in the implementation of the method. Using the Shereman-Morrison-Woodbury formula, we can derive the update of the inverse Hessian approximation \(H_k\)

\[\text{(DFP)} \;\;\; H_{k+1} = H_k - \frac{H_ky_ky_k^\top H_k}{y_k^\top H_ky_K} + \frac{s_ks_k^\top}{y_k^\top s_k}\]

The DFP updating formula was soon superseded by the BFGS formula. Instead of imposing conditions on \(B_k\), we impose similar conditions on their inverses \(H_k\). The secant condition is now written as

\[H_{k+1}y_k = s_k\]

The condition of closeness is given by

\[\begin{split}\min_H \; & \lVert H - H_k \rVert \\ \text{subject to } \; & H = H^\top, Hy_k = s_k\end{split}\]

The unique solution \(H_{k+1}\) is given by

\[\text{(BFGS)} \;\;\; H_{k+1} = (I - \rho_ks_ky_k^\top)H_k(I - \rho_ky_ks_k^\top) + \rho_ks_ks_k^\top\]

with

\[\rho_k = \frac{1}{y_k^\top s_k}\]
Algorithm 6.1 (BFGS Method).
Given starting point \(x_0\), convergence tolerance \(\epsilon > 0\), inverse Hessian approximation \(H_0\);
\(k \leftarrow 0\);
while \(\lVert \nabla f_k \rVert > \epsilon\);
Compute search direction \(p_k = -H_k\nabla f_k\);
Set \(x_{k+1} = x_k + \alpha_kp_k\) where \(\alpha_k\) is computed from a line search procedure to satisfy the Wolfe conditions;
Define \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f_{k+1} - \nabla f_k\);
Compute \(H_{k+1}\) with \(H_{k+1} = (I - \rho_ks_ky_k^\top)H_k(I - \rho_ky_ks_k^\top) + \rho_ks_ks_k^\top\);
\(k \leftarrow k + 1\);
end while

Each iteration can be performed at a cost of \(O(n^2)\) arithmetic operations and the rate of convergence is superlinear, which is fast enough for most practical purposes.

We can derive a version of the BFGS algorithm that works with the Hessian approximation \(B_k\) rather than \(H_k\). With Sherman-Morrison-Woodbury we have

\[\text{(BFGS)} \;\;\; B_{k+1} = B_k - \frac{B_ks_ks_k^\top B_k}{s_k^\top B_ks_k} + \frac{y_ky_k^\top}{y_k^\top s_k}\]

Properties of The BFGS Method

implementation