# Notes on Convex Optimization (5): Newton's Method

For $x\in\mathbf{dom}\ f$, the vector

is called the Newton step (for $f$, at $x$).

#### Minimizer of second-order approximation

The second-order Taylor approximation $\hat f$ of $f$ at $x$ is

$$\hat f(x+v) = f(x) + \nabla f(x)^T v + \frac12 v^T \nabla^2 f(x) v. \tag{1} \label{eq:1}$$

which is a convex quadratic function of $v$, and is minimized when $v=\Delta x_{nt}$. Thus, the Newton step $\Delta x_{nt}$ is what should be added to the point $x$ to minimize the second-order approximation of $f$ at $x$.

# Notes on Convex Optimization (4): Gradient Descent Method

### Descent methods

• $f(x^{(k+1)}) < f(x^{(k)})$
• $\Delta x$ is the step or search direction; $t$ is the step size or step length
• from convexity, $\nabla f(x)^T \Delta x <0$

General descent method.
given a starting point $x \in \mathbf{dom} \enspace f$.
repeat
- Determine a descent direction $\Delta x$.
- Line search. Choose a step size $t > 0$.
- Update. $x := x+ t\Delta x$.

until stopping criterion is satisfied

# Notes on Convex Optimization (3): Unconstrained Minimization Problems

Unconstrained optimization problems are defined as follows:

$$\text{minimize}\quad f(x) \tag{1} \label{eq:1}$$

where $f: \mathbf{R}^n \rightarrow \mathbf{R}$ is convex and twice continously differentiable (which implies that $\mathbf{dom}\enspace f$ is open). We denote the optimal value $\inf_xf(x)=f(x^\ast)$, as $p^\ast$. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^\ast$ to be optimal is

$$\nabla f(x^\ast)=0. \tag{2} \label{eq:2}$$

Thus, solving the unconstrained minimization problem \eqref{eq:1} is the same as finding a solution of \eqref{eq:2}, which is a set of $n$ equations in the $n$ variables $x_1, \dots, x_n$. Usually, the problem must be solved by an iterative algorithm. By this we mean an algorithm that computes a sequence of points $x^{(0)}, x^{(1)}, \dots \in \mathbf{dom}\enspace f$ with $f(x^{(k)})\rightarrow p^\ast$ as $k\rightarrow\infty$. The algorithm is terminated when $f(x^{k}) - p^\ast \le \epsilon$, where $\epsilon>0$ is some specified tolerance.

# [Notes on Mathematics for ESL] Chapter 10: Boosting and Additive Trees

### 10.5 Why Exponential Loss?

#### Derivation of Equation (10.16)

Since $Y\in{-1,1}$, we can expand the expectation as follows:

In order to minimize the expectation, we equal derivatives w.r.t. $f(x)$ as zero:

which gives:

# [Notes on Mathematics for ESL] Chapter 6: Kernel Smoothing Methods

### 6.1 One-Dimensional Kernel Smoothers

#### Notes on Local Linear Regression

Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:

The estimate is $\hat f(x_0)=\hat\alpha(x_0)+\hat\beta(x_0)x_0$. Define the vector-value function $b(x)^T=(1,x)$. Let $\mathbf{B}$ be the $N \times 2$ regression matrix with $i$th row row $b(x_i)^T$, $\mathbf{W}(x_0)$ the $N\times N$ diagonal matrix with $i$th diagonal element $K_\lambda (x_0, x_i)$, and $\theta=(\alpha(x_0), \beta(x_0))^T$.

Then the above optimization problem can be rewritten as

Equal the derivative w.r.t $\theta$ as zero, we get

# [Notes on Mathematics for ESL] Chapter 5: Basis Expansions and Regularization

### 5.4 Smoothing Splines

#### Derivation of Equation (5.12)

Equal the derivative of Equation (5.11) as zero, we get

Put the terms related to $\theta$ on one side and the others on the other side, we get

Multiply the inverse of $N^TN+\lambda\Omega_N$ on both sides completes the derivation of Equation (5.12)

# Notes on Convex Optimization (2): Convex Functions

### 1. Basic Properties and Examples

#### 1.1 Definition

$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if $\mathbf{dom}\ f$ is a convex set and

for all $x,y\in \mathbf{dom}\ f, 0\le\theta\le1$

• $f$ is concave if $-f$ is convex
• $f$ is strictly convex if $\mathbf{dom}\ f$ is convex and $% $ for $x,y\in\mathbf{dom}\ f,x\ne y, 0<\theta<1$

$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if and only if the function $g: \mathbb{R} \rightarrow \mathbb{R}$,

is convex (in $t$) for any $x \in \mathbf{dom}\ f, v\in\mathbb R^n$

# Notes on Convex Optimization (1): Convex Sets

### 1. Affine and Convex Sets

Suppose $x_1\ne x_2$ are two points in $\mathbb{R}^n$.

#### 1.1 Affine sets

line through $x_1$, $x_2$: all points

affine set: contains the line through any two distinct points in the set

# [Notes on Mathematics for ESL] Chapter 4: Linear Methods for Classification

### 4.3 Linear Discriminant Analysis

#### Derivation of Equation (4.9)

For that each class’s density follows multivariate Gaussian

Take the logarithm of $f_k(x)$, we get

where $c = -\log [(2\pi)^{p/2}\lvert\Sigma\rvert^{1/2}]$ and $\mu_k^T\Sigma^{-1}x=x^T\Sigma^{-1}\mu_k$. Following the above formula, we can derive Equation (4.9) easily

# [Notes on Mathematics for ESL] Chapter 3: Linear Regression Models and Least Squares

### 3.2 Linear Regression Models and Least Squares

#### Derivation of Equation (3.8)

The least squares estimate of $\beta$ is given by the book’s Equation (3.6)

From the previous post, we know that $\mathrm{E}(\mathbf{y})=X\beta$. As a result, we obtain

Then, we get

The variance of $\hat \beta$ is computed as

If we assume that the entries of $\mathbf{y}$ are uncorrelated and all have the same variance of $\sigma^2$, then $\mathrm{Var}(\varepsilon)=\sigma^2I_N$ and the above equation becomes

This completes the derivation of Equation (3.8).