For x∈dom f, the vector
Δxnt=−∇2f(x)−1∇f(x)is called the Newton step (for f, at x).
Minimizer of second-order approximation
The second-order Taylor approximation ˆf of f at x is
ˆf(x+v)=f(x)+∇f(x)Tv+12vT∇2f(x)v.
which is a convex quadratic function of v, and is minimized when v=Δxnt. Thus, the Newton step Δxnt is what should be added to the point x to minimize the second-order approximation of f at x.
Steepest descent direction in Hessian norm
The Newton step is also the steepest descent direction at x, for the quadratic norm defined by the Hessian ∇2f(x), i.e.,
‖u‖∇2f(x)=(uT∇2f(x)u)1/2.Solution of linearized optimality condition
If we linearize the optimality condition ∇f(x∗)=0 near x we obtain
∇f(x+v)≈∇f(x)+∇2f(x)v=0which is a linear equation in v, with solution v=Δxnt. So the Newton step Δxnt is what must be added to x so that the linearized optimality condition holds.
Affine invariance of the Newton step
An important feature of the Newton step is that it is independent of linear changes of coordinates. Suppose T∈Rn×n is nonsingular, and define ˉf(y)=f(Ty). Then we have
∇ˉf(y)=∇f(Ty)=∂f(Ty)∂(Ty)∂(Ty)y=TT∇f(x),where x=Ty, likewise we have ∇2ˉf(y)=TT∇2f(x)T. The Newton step for ˉf at y is therefore
Δynt=−(TT∇2f(x)T)−1(TT∇f(x))=−T−1∇2f(x)−1∇f(x)=T−1Δxnt,where Δxnt is the Newton step for f at x. Hence the Newton steps of f and ˉf are related by the same linear transformation, and
x+Δxnt=T(y+Δynt).The Newton decrement
The quantity
λ(x)=(∇f(x)T∇2f(x)−1∇f(x))1/2is called the Newton decrement at x. We can relate the Newton decrement to the quantity f(x)−infyˆf(y), where ˆy is the second-order approximation of f at x:
f(x)−infyˆf(y)=f(x)−ˆf(x+Δxnt)=12λ(x)2.We can also express the Newton decrement as
λ(x)=(ΔxTnt∇2f(x)Δxnt)1/2.This shows that λ is the norm of the Newton step, in the quadratic norm defined by the Hessian.
Newton’s Method
Newton’s method.
given a starting point x∈domf, tolerance ϵ>0.
repeat
- Compute the Newton step Δxnt and decrement λ2.
- Stopping criterion. quit** if λ(x)2/2≤ϵ.
- *Line search. Choose a step size t>0 by backtracking line search.
- Update. x:=x+tΔxnt.
Summary
Newton’s method has several very strong advantages over gradient and steepest descent methods:
- Convergence of Newton’s method is rapid in general, and quadratic near x∗. Once the quadratic convergence phase is reached, at most six or so iterations are required to produce a solution of very high accuracy.
- Newton’s method is affine invariant. It is insensitive to the choice of coordinates, or the condition number of the sublevel sets of the objective.
- The good performance of Newton’s method is not dependent on the choice of algorithm parameters. In contrast, the choice of norm for steepest descent plays a critical role in its performance.
The main disadvantage of Newton’s method is the cost of forming and storing the Hessian, and the cost of computing the Newton step, which requires solving a set of linear equations.