6.1 The BFGS Method¶
The BFGS method is named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno.
We begin with the quadratic model of the objective function at the current iterate \(x_k\):
Here \(B_k\) is an \(n \times n\) symmetric positive definite matrix that will be updated at every iteration. The minimizer \(p_k\) of this model is given by
and is used as the search direction. The new iterate is
where \(\alpha_k\) is chosen to satisfy the Wolfe conditions.
Davidon proposed to update \(B_k\) in a simple manner to account for the curvature measured during the most recent step. Given \(x_{k+1}\), we wish to construct a new quadratic model, of the form
One resonable requirement of \(B_{k+1}\) is that the gradient of \(m_{k+1}\) should match the gradient of the objective function \(f\) at \(x_k\) and \(x_{k+1}\). Since \(\nabla m_{k+1}(0)\) is precisely \(\nabla f_{k+1}\), we only care about the first condition, written by
Then we obtain
To simplify the notation, we define the vectors
So we have
We refer to this formula as the secant equation.
The secant equation requires \(B_k\) maps \(s_k\) into \(y_k\). This is possible only if \(s_k\) and \(y_k\) satisfy the curvature condition
This does not always hold especially for nonconvex functions. However, it is guaranteed to hold if we impose the Wolfe or strong Wolfe conditions on the line search.
To determine \(B_k\) uniquely, we impose the additional condition that among all symmetric matrices satisfying the secant equation, :math:`B_k` is, in some sense, closet to the current matrix :math:`B_k`. In other words, we solve the problem
where \(s_k\) and \(y_k\) satisfy the curvature condition and \(B_k\) is symmtric and positive definite. A matrix norm that allows easy solution of the minimization problem and gives rise to a scale-invariant optimization method is the weighted Frobenius norm
where \(\lVert C \rVert_F = \sum_{i=1}^n \sum_{j=1}^n c_{ij}^2\). The weight matrix can be chosen as any matrix satisfying the relation \(Wy_k = s_k\). We assume that \(W = \bar{G}_k^{-1}\) where \(\bar{G}_k\) is the average Hessian defined by
The property
follows from Taylor’s theorem. With this weighting matrix and this norm, the unique solution is
with
This formula is called the DFP updating formula.
The inverse of \(B_k\), denoted by \(H_k = B_k^{-1}\) is useful in the implementation of the method. Using the Shereman-Morrison-Woodbury formula, we can derive the update of the inverse Hessian approximation \(H_k\)
The DFP updating formula was soon superseded by the BFGS formula. Instead of imposing conditions on \(B_k\), we impose similar conditions on their inverses \(H_k\). The secant condition is now written as
The condition of closeness is given by
The unique solution \(H_{k+1}\) is given by
with
Each iteration can be performed at a cost of \(O(n^2)\) arithmetic operations and the rate of convergence is superlinear, which is fast enough for most practical purposes.
We can derive a version of the BFGS algorithm that works with the Hessian approximation \(B_k\) rather than \(H_k\). With Sherman-Morrison-Woodbury we have