is called the *Newton step* (for $f$, at $x$).

The second-order Taylor approximation $\hat f$ of $f$ at $x$ is

\begin{equation} \hat f(x+v) = f(x) + \nabla f(x)^T v + \frac12 v^T \nabla^2 f(x) v. \tag{1} \label{eq:1} \end{equation}

which is a convex quadratic function of $v$, and is minimized when $v=\Delta x_{nt}$. Thus, the Newton step $\Delta x_{nt}$ is what should be added to the point $x$ to minimize the second-order approximation of $f$ at $x$.

The Newton step is also the steepest descent direction at $x$, for the quadratic norm defined by the Hessian $\nabla^2 f(x)$, *i.e.*,

If we linearize the optimality condition $\nabla f(x^*)=0$ near $x$ we obtain

which is a linear equation in $v$, with solution $v=\Delta x_{nt}$. So the Newton step $\Delta x_{nt}$ is what must be added to $x$ so that the linearized optimality condition holds.

An important feature of the Newton step is that it is independent of linear changes of coordinates. Suppose $T\in \mathbf{R}^{n \times n}$ is nonsingular, and define $\bar f(y)=f(Ty)$. Then we have

where $x=Ty$, likewise we have $\nabla^2 \bar f(y) = T^T\nabla^2f(x)T$. The Newton step for $\bar f$ at $y$ is therefore

where $\Delta x_{nt}$ is the Newton step for $f$ at $x$. Hence the Newton steps of $f$ and $\bar f$ are related by the same linear transformation, and

The quantity

is called the *Newton decrement* at $x$. We can relate the Newton decrement to the quantity $f(x) - \inf_y \hat f(y)$, where $\hat y$ is the second-order approximation of $f$ at $x$:

We can also express the Newton decrement as

This shows that $\lambda$ is the norm of the Newton step, in the quadratic norm defined by the Hessian.

Newton’s method.

givena starting point $x \in \mathbf{dom} \enspace f$, tolerance $\epsilon > 0$.

repeat

- Compute the Newton step $\Delta x_{nt}$ and decrement $\lambda^2$.

- Stopping criterion.quit** if $\lambda(x)^2/2 \le \epsilon$.. Choose a step size $t > 0$ by backtracking line search.

- *Line search

- Update. $x := x+ t\Delta x_{nt}$.

Newton’s method has several very strong advantages over gradient and steepest descent methods:

- Convergence of Newton’s method is rapid in general, and quadratic near $x^\ast$. Once the quadratic convergence phase is reached, at most six or so iterations are required to produce a solution of very high accuracy.
- Newton’s method is affine invariant. It is insensitive to the choice of coordinates, or the condition number of the sublevel sets of the objective.
- The good performance of Newton’s method is not dependent on the choice of algorithm parameters. In contrast, the choice of norm for steepest descent plays a critical role in its performance.

The main disadvantage of Newton’s method is the cost of forming and storing the Hessian, and the cost of computing the Newton step, which requires solving a set of linear equations.

]]>- $f(x^{(k+1)}) < f(x^{(k)})$
- $\Delta x$ is the
*step*or*search direction*; $t$ is the*step size*or*step length* - from convexity, $\nabla f(x)^T \Delta x <0$

General descent method.

givena starting point $x \in \mathbf{dom} \enspace f$.

repeat

- Determine a descent direction $\Delta x$.

-Line search. Choose a step size $t > 0$.

- Update. $x := x+ t\Delta x$.

untilstopping criterion is satisfied

Backtracking line search.

givena descent direction $\Delta x$ for $f$ at $x \in \mathbf{dom} f, \alpha \in(0,0.5), \beta\in(0,1)$.

startingat $t:=1$.

while$f(x+t\Delta x) > f(x) + \alpha t \nabla f(x)^T \Delta x$, $t:=\beta t$

A natural choice for the search direction is the negative gradient $\Delta x = - \nabla f(x)$.

We must have $f(x^(k)) - p^\ast \le \epsilon$ after at most

iterations of the gradient method with exact line search, where $c=1-m/M<1$.

Similar to exact line search, except that $c=1 - \min{2m\alpha, 2\beta\alpha m/M} < 1.$

- The gradient method often exhibits approximately linear convergence,
*i.e.*, the error $f(x^{(k)}) - p^\ast$ converges to zeros approximately as a geometric series. - The choice of backtracking parameters $\alpha, \beta$ has a noticeable but not dramatic effect on the convergence. An exact line search sometimes improves the convergence of the gradient method, but the effect is not large.
- The convergence rate depends greatly on the condition number of the Hessian, or the sublevel sets. Convergence can be very slow, even for problems that are moderately well conditioned. When the condition number is larger the gradient method is so slow that it is useless in practice.

The first-order Taylor approximation of $f(x+v)$ around $x$ is

The second term on the righthand side, $\nabla f(x)^T v$, is the *directional derivative* of $f$ at $x$ in the direction $v$. It gives the approximate change in $f$ for a small step $v$. The step $v$ is a descent direction if the directional derivative is negative.

Let $\lVert \cdot \rVert$ be any norm on $\mathbf{R}^n$. We define a *normailzied steepest descent direction* as

\begin{equation} \Delta x_{nsd} = \arg\min{\nabla f(x)^T v\ \vert\ \lVert v \rVert = 1}. \tag{1}\label{eq:1} \end{equation}

It is also convenient to consider a steepest descent step $\Delta x_{sd}$ that is *unnormalized*, by scaling the normalized steepest descent direction in a particular way:

\begin{equation} \Delta x_{sd} = \lVert \nabla f(x) \rVert_\ast \Delta x_{nsd}, \tag{2}\label{eq:2} \end{equation}

where $\lVert \cdot \rVert_\ast$ denotes the dual norm. Note that for the steepest descent step, we have

To simplify the notation, we can look at the problem of solving $\min_v{u^Tv\ \lvert\ \lVert v \rVert \le 1}$ which ends up being equivalent to find the normalized steepest descent step.

The Cauchy-Schwarz inequality gives $\lvert u^Tv\rvert \le \rVert u \rVert \lVert v \rVert$, hence it is easy to see that the minimum is $\min_v{u^Tv\ \lvert\ \lVert v \rVert \le 1}=-\lVert u \rVert$, and the minimizer is $v=-u/\lVert u \rVert$. As a result, the steepest descent direction is simply the negative gradient, *i.e.*, $\Delta x_{sd} = - \nabla f(x)$.

We consider the quadratic norm

where $P \in \mathbf{S}_{++}^n$. The problem is now $\min_v{u^Tv\ \vert\ \lVert P^{1/2}v\rVert\le1}=\min_v{u^Tv\ \vert\ \lVert\delta\rVert\le1, v=P^{-1/2}\delta}$. This is equivalent to $\min_\delta{((P^{-1/2})^Tu)^T\delta\ \vert\ \lVert\delta\rVert\le1}$. The problem above shows that the minimum is $-\lVert (P^{-1/2})^Tu\rVert$ while the maximum $\lVert (P^{-1/2})^Tu\rVert$ is the dual norm according to the definition, and the minimizer is $v=P^{-1/2}\delta=-P^{-1}u/\lVert (P^{-1/2})^Tu\rVert$, so the steepest descent desnt is given by

In addition, the steepest descent method in the quadratic norm $\lVert \cdot \rVert_P$ can be thought of as the gradient method applied to the problem after the change of coordinates $\bar x=P^{1/2}x$.

Let $i$ be any index for which $\lVert \nabla f(x) \rVert_\infty = \lvert (\nabla f(x))_i \rvert$. Then a normalized steepest descent direction $\nabla x_{nsd}$ for the $l_1$-norm is given by

where $e_i$ is the $i$th standard basis vector. An unnormalized steepest descent step is then

The steepest descent algorithm in the $l_1$-norm has a very natural interpertation: At each iteration we select a component of $\nabla f(x)$ with maximum absolute value, and then decrease or increase the corresponding component of $x$, according to the sign of $(\nabla f(x))_i$. The algorithm is sometimes called a *corrdinate-descent* algorithm, since only one component of the variable $x$ is updated at each iteration.

\begin{equation} \text{minimize}\quad f(x) \tag{1} \label{eq:1} \end{equation}

where $f: \mathbf{R}^n \rightarrow \mathbf{R}$ is convex and twice continously differentiable (which implies that $\mathbf{dom}\enspace f$ is open). We denote the optimal value $\inf_xf(x)=f(x^\ast)$, as $p^\ast$. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^\ast$ to be optimal is

\begin{equation} \nabla f(x^\ast)=0. \tag{2} \label{eq:2} \end{equation}

Thus, solving the unconstrained minimization problem \eqref{eq:1} is the same as finding a solution of \eqref{eq:2}, which is a set of $n$ equations in the $n$ variables $x_1, \dots, x_n$. Usually, the problem must be solved by an iterative algorithm. By this we mean an algorithm that computes a sequence of points $x^{(0)}, x^{(1)}, \dots \in \mathbf{dom}\enspace f$ with $f(x^{(k)})\rightarrow p^\ast$ as $k\rightarrow\infty$. The algorithm is terminated when $f(x^{k}) - p^\ast \le \epsilon$, where $\epsilon>0$ is some specified tolerance.

The starting point $x^{(0)}$ must lie in $\mathbf{dom}\enspace f$, and in addition the sublevel set

must be closed. This condition is satisfied for all $x^{(0)}\in\mathbf{dom}\enspace f$ if the function $f$ is closed.

Note: 1) Continuous functions with $\mathbf{dom}\enspace f=\mathbf{R}^n$ are closed; 2) Another important class of closed functions are continuous functions with open domains.

The general convex quadratic minimization problem has the form

where $P\in\mathbf{S}_+^n$, $q\in\mathbf{R}^n$, and $r\in\mathbf{R}$. This problem can be solved via the optimality conditions, $Px+q=0$, which is a set of linear equations.

One special case of the quadratic minimization problem that arises very frequently is the least-squares problem

The optimality condition

are called the *normal equations* of the least-squares problem.

where the domain of $f$ is the open set

where $F: \mathbf{R}^n\rightarrow\mathbf{S}^p$ is affine. Here the domain of $f$ is

The objective function is *strongly convex* on $S$, which means that there exists an $m>0$ such that

\begin{equation} \nabla^2f(x) \succeq mI \tag{3} \label{eq:3} \end{equation}

for all $x\in S$. For $x, y \in S$ we have

for some $z$ on the line segement $[x, y]$. By the strong convexity assumption \eqref{eq:3}, the last term on the righthand side is at least $(m/2)\lVert y-x\rVert^2_2$, so we have the inequality

\begin{equation} f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{m}2\lVert y-x \rVert_2^2 \tag{4} \label{eq:4} \end{equation}

for all $x$ and $y$ in $S$.

Setting the gradient of the righthand side of \eqref{eq:4} with respect to $y$ equal to zero, we find that $\tilde y = x-(1/m)\nabla f(x)$ minimizes the righthand side. Therefore we have

Since this holds for any $y\in S$, we have

\begin{equation} p^* \ge f(x) - \frac1{2m}\lVert \nabla f(x)\rVert_2^2 \tag{5} \label{eq:5} \end{equation}

This inequality shows that if the gradient is small at a point, then the point is nearly optimal.

Apply \eqref{eq:4} with $y=x^\ast$ to obtain

where we use the Cauchy-Schwarz inequality in the second inequality. Since $p^\ast \le f(x)$, we must have

Therefore, we have

\begin{equation} \lVert x - x^\ast \rVert_2 \le \frac2m\lVert \nabla f(x) \rVert_2. \tag{6}\label{eq:6} \end{equation}

If there are two optimal point $x^\ast_1, x^\ast_2$, according to \eqref{eq:6},

Hence, $x_1^\ast = x_2^\ast$, the optimal point $x^\ast$ is unique.

There exists a constant $M$ such that

for all $x \in S$. This upper bound on the Hessian implies for any $x, y \in S$,

minimizing each side over $y$ yields

The ratio $\kappa=M/m$ is an upper bound on the condition number of the matrix $\nabla^2 f(x)$, *i.e.*, the ratio of its largest eigenvalue to its smallest eigenvalue.

We define the *width* of a convex set $C \subseteq \mathbf{R}^n$, in the direction $q$, where $\lVert q \rVert_2 = 1$, as

The *minimum width* and *maximum width* of $C$ are given by

The *condition number* of the convex set $C$ is defined as

Suppose $f$ satisfies $mI \preceq \nabla^2 f(x) \preceq MI$ for all $x\in S$. The condition number of the $\alpha$-sublevel $C_\alpha={x \vert f(x) \le \alpha}$, where $p^\ast < \alpha \le f(x^{(0)})$, is bounded by

It must be kept in mind that the constants $m$ and $M$ are known only in rare cases, so they cannot be used in a practical stopping criterion.

]]>Since $Y\in{-1,1}$, we can expand the expectation as follows:

In order to minimize the expectation, we equal derivatives w.r.t. $f(x)$ as zero:

which gives:

If $Y=1$, then $Y’=1$, which gives

Likewise, if $Y=-1$, then $Y’=0$, which gives

As a result, the *binomial log-likelihood loss* is equivalent to the *deviance*. In the language of neural networks, the *cross-entropy* is equivalent to the *softplus*. The only difference is that $0$ is used to indicate negative examples in *cross-entropy*; while $-1$ is used in *softplus*.

This section explains the choice of loss functions for both classification and regression. It gives a very direct expalanation about why square loss is undesirable for classification. Highly recommended!

]]>Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:

The estimate is $\hat f(x_0)=\hat\alpha(x_0)+\hat\beta(x_0)x_0$. Define the vector-value function $b(x)^T=(1,x)$. Let $\mathbf{B}$ be the $N \times 2$ regression matrix with $i$th row row $b(x_i)^T$, $\mathbf{W}(x_0)$ the $N\times N$ diagonal matrix with $i$th diagonal element $K_\lambda (x_0, x_i)$, and $\theta=(\alpha(x_0), \beta(x_0))^T$.

Then the above optimization problem can be rewritten as

Equal the derivative w.r.t $\theta$ as zero, we get

Then

It’s claimed that $\sum_{i=1}^Nl_i(x_0)=1$ and $\sum_{i=1}^N(x-x_0)l_i(x_0)=0$ in the book, so that the bias $\text{E}(\hat f(x_0))-f(x_0)$ depends only on quadratic and higher-order terms in the expansion of $f$. However, the proof is not given. Here I will give the detailed derivations of these two equations.

First, define the following terms:

Then, we can represent the estimate as

When $y=\mathbf{1}$, $m_0=S_0$ and $m_1=S_1$, we get

When $y=\mathbf{x}-x_0$,

More generally, it’s easy to show that $\sum_{i=1}^N(x_i-x_0)^pl_i(x_0)=0$ when $p>0$.

We only prove the case when the input $x$ is one-dimensional. Similar strategy can be used to prove the case for high-dimensional input, but it’ll be a little bit complicated if you’re interested. Have fun!

]]>Equal the derivative of **Equation (5.11)** as zero, we get

Put the terms related to $\theta$ on one side and the others on the other side, we get

Multiply the inverse of $N^TN+\lambda\Omega_N$ on both sides completes the derivation of **Equation (5.12)**

It’s a little confusing to get **Equation (5.18)** directly from **Equation (5.17)** and its original form **Equation (5.11)**. In order to give a clear explanation, here we give the proof of the equation,

which are the different terms between **Equation (5.11)** and **Equation (5.18)**.

We know that

following **Equation (5.14)** in the book.

From **Equation (5.17)**, we can get

Plug the above two equation into the right side of the equation remains to be proved, we get

which completes the proof.

]]>$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if $\mathbf{dom}\ f$ is a convex set and

for all $x,y\in \mathbf{dom}\ f, 0\le\theta\le1$

- $f$ is concave if $-f$ is convex
- $f$ is strictly convex if $\mathbf{dom}\ f$ is convex and for $x,y\in\mathbf{dom}\ f,x\ne y, 0<\theta<1$

$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if and only if the function $g: \mathbb{R} \rightarrow \mathbb{R}$,

is convex (in $t$) for any $x \in \mathbf{dom}\ f, v\in\mathbb R^n$

extended-value extension $\tilde f$ of $f$ is

**1st-order condition**: differentiable $f$ with convex domain is convex iff

**2nd-order conditions**: for twice differentiable $f$ with convex domain
- $f$ if convex if and only if
- if $\nabla^2f(x)\succ0$ for all $x\in\mathbf{dom}\ f$, then $f$ is strictly convex

$\alpha$**-sublevel set** of $f: \mathbb R^n \rightarrow \mathbb R$:

sublevel sets of convex functions are convex (converse if false)

If $f$ is concave, then its $\alpha$**-superlevel set**, given by ${x\in\mathbf{dom}\ f\mid f(x)\le\alpha}$, is a convex set

**epigraph** of $f:\mathbb R^n \rightarrow \mathbb R$:

$f$ is convex if and only if $\mathbf{epi}\ f$ is a convex set

$f$ is concave if and only if its **hypograph**, defined as

is a convex set

**Jensen’s Inequality**: if $f$ is convex, then for $0\le\theta\le1$,

**extension**: if $f$ is convex, then

for any random variable $z$

**nonnegative multiple**: $\alpha f$ is convex if $f$ is convex, $\alpha \ge 0$

**sum**: $f_1+f_2$ convex if $f_1,f_2$ convex (extends to infinite sums, integrals)

**composition with affine function**: $f(Ax+b)$ is convex if $f$ is convex

**pointwise maximum**: if $f_1,\dots,f_m$ are convex, then $f(x)=\max{f_1(x),\dots,f_m(x)}$ is convex

**pointwise supermum**: if $f(x,y)$ is convex in $x$ for each $y\in\mathcal{A}$, then

is convex

similarly, the **pointwise infimum** of a set of concave functions is a concave function

**composition with scalar functions**: composition of $g: \mathbb R^n \rightarrow \mathbb R$ and $h: \mathbb R\rightarrow \mathbb R$:

$f$ is convex if $g$ convex, $h$ convex, $\tilde h$ nondecreasing; $g$ concave, $h$ convex, $\tilde h$ nonincreasing

Note: monotonicity must hold for extended-value extension $\tilde h$

**vector composition**: composition of $g:\mathbb R^n \rightarrow \mathbb R^k$ and $h:\mathbb R^k \rightarrow \mathbb R$:

$f$ is convex if $g_i$ convex, $h$ convex, $\tilde h$ nondecreasing in each argument; $g$ concave, $h$ convex, $\tilde h$ nonincreasing in each argument

**minimization**: if $f(x,y)$ is convex in $(x,y)$ and $C$ is a convex set then

is convex

**perspective**: the **perspective** of a function $f:\mathbb R^n \rightarrow \mathbb R$ is the function $g:\mathbb R^n \times \mathbb R \rightarrow \mathbb R$,

$g$ is convex if $f$ is convex

the **conjugate** of a function $f$ is

- $f^*$ is convex whether or not $f$ is convex

**conjugate of the conjugate**: if $f$ is convex and closed, then $f^{**}=f$

**differentiable functions**: The conjugate of a differentiable function $f$ is also called the *Legendre transform* of $f$. Let $z\in\mathbb{R}^n$ be arbitrary and define $y=\nabla f(z)$, then we have

**scaling and composition with affline transformation**:
For $a>0$ and $b\in\mathbb{R}$, the conjugate of $g(x)=af(x)+b$ is $g^*(y)=af^*(y/a)-b$.

Suppose $A\in\mathbb{R}^{n\times n}$ is nonsingular and $b\in\mathbb{R}^n$. Then the conjugate of $g(x)=f(Ax+b)$ is

with $\mathbf{dom}\ g^*=A^T\mathbf{dom}\ f^*$

**sums of independent functions**: if $f(u,v)=f_1(u)+f_2(v)$, where $f_1$ and $f_2$ are convex functions with conjugates $f_1^*$ and $f_2^*$, respectively, then

$f:\mathbb R^n\rightarrow \mathbb R$ is quasiconvex if $\mathbf{dom}\ f$ is convex and the sublevel sets

are convex for all $\alpha$

- $f$ is quasiconcave if $-f$ is quasiconvex
- $f$ is quasilinear if it is quasiconvex and quasiconcave
- convex functions are quasiconvex, but the converse is not true

**modified Jensen inequality**: for quasiconvex $f$

**first-order condition**: differentiable $f$ with convex domain is quasiconvex iff

**operations that preserve quasiconvexity**:

- nonnegative weighted maximum
- composition
- minimization

**sum** of quasiconvex functions are not necessarily quasiconvex

a positive function $f$ is log-concave if $\log f$ is concave:

$f$ is log-convex if $\log f$ is convex

**properties of log-concave functions**:

- twice differentiable $f$ with convex domain is log-concave iff

for all $x\in\mathbf{dom}\ f$

- product of log-concave functions is log-concave
- sum of log-concave functions is not always log-concave; however, log-convexity is preserved under sums
- integration: if $f:\mathbb R^n\times\mathbb R^m \rightarrow \mathbb R$ is log-concave, then

is log-concave

**consequences of integration property**:

- convolution $f*g$ of log-concave functions $f,g$ is log-concave

- if $C\subseteq \mathbb R^n$ convex and $y$ is a random variable with log-concave pdf then

is log-concave

$f:\mathbb{R}^n\rightarrow\mathbb{R}^m$ is $K$-convex if $\mathbf{dom}\ f$ is convex and

for $x,y\in\mathbf{dom}\ f,0\le\theta\le1$

]]>Suppose $x_1\ne x_2$ are two points in $\mathbb{R}^n$.

**line** through $x_1$, $x_2$: all points

**affine set**: contains the line through any two distinct points in the set

**line segment** between $x_1$ and $x_2$: all points

with $0\leq\theta\leq1$

**convex set**: contains line segment between any two points in the set

**convex combination** of $x_1,\dots,x_k$: any point $x$ of the form

with $\theta_1+\dots+\theta_k=1,\theta_i \geq 0$

**convex hull** of a set $C$, denoted $\mathbf{conv}\ C$: set of all convex combinations of points in $C$

**conic combination** of $x_1$ and $x_2$: any point of the form

with $\theta_1 \geq 0, \theta_2 \geq 0$

**convex cone**: set that contains all conic combinations of points in the set

**hyperplane**: set of the form {$x\mid a^Tx=b$}$(a\ne0)$

**halfspace**: set of the form {$x\mid a^Tx\leq b$}$(a\ne0)$

- $a$ is the normal vector
- hyperplanes are affine and convex; halfspaces are convex

**(Euclidean) ball** with center $x_c$ and radius $r$:

**ellipsoid**: set of the form

with $P\in \mathbf{S}^n_{++}$ (*i.e.*, P symmetric positive definite)

another representation: {$x_c+Au\mid \lVert u\rVert_2\le1$} with $A$ square and nonsingular

- Euclidean balls and ellipsoids are all convex.

**norm**: a funtion $\lVert \centerdot \rVert$ that satisfies

- $\lVert x \rVert \geq 0$; $\lVert x \rVert=0$ if and only if $x=0$
- $\lVert tx \rVert = \lvert t \rvert \lVert x \rVert$ for $t\in \mathbb{R}$
- $\lVert x+y\rVert \leq \lVert x \rVert+\lVert y \rVert$

**norm ball** with center $x_c$ and radius

**norm cone**:

- norm balls and cones are convex
- norm cores (as the name suggest) are convex cones

**polyhedra**: solution set of finitely many linear inequalities and equalities

($A\in \mathbb{R}^{m\times n}$, $C\in\mathbb{R}^{p\times n}$, $\preceq$ is componentwise inequality)

- polyhedron is intersection of finite number of halfspaces and hyperplances

**positive semidefinite cone**:

- $\mathbf{S}^n$ is set of symmetric $n\times n$ matrices
- : positive semidefinite $n\times n$ matrices $\mathbf{S}^n_+$ is a convex cone
- : positive definite $n\times n$ matrices

**intersection**: the interction of (any number of) convex sets is convex

**affine function**: suppose $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is affine ($f(x)=Ax+b$ with $A\in\mathbb{R}^{m\times n}, b\in\mathbb{R}^m$)

- the image of a convex set under $f$ is convex

- the inverse image $f^{-1}(C)$ of a convex set under $f$ is convex

**perspective function** $P: \mathbb{R}^{n+1} \rightarrow \mathbb{R}^n$:

images and inverse images of convex sets under perspective are convex

**linear-fractional function** $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$:

images and inverse images of convex sets under linear-fractional functions are convex

a convex cone $K\subseteq\mathbb{R}^n$ is a **proper cone** if

- $K$ is closed (contains its boundary)
- $K$ is solid (has nonempty interior)
- $K$ is pointed (contains no line)

**generalized inequality** defined by a proper cone $K$:

$x\in S$ is **the minimum element** of $S$ with respect to $\preceq_K$ if

$x\in S$ is **a minimal element** of $S$ with respect to $\preceq_K$ if

**separating hyperplane theorem**: if $C$ and $D$ are disjoint convex sets, then there exists $a\ne0$, $b$ such that

**supporting hyperplane** to set $C$ at boundary point $x_0$:

where $a\ne0$ and $a^Tx\le a^Tx_0$ for all $x\in C$

**supporting hyperplance theorem**: if $C$ is convex, then there exists a supporting hyperplane at every boundary point of $C$

**dual cone** of a cone $K$:

Dual cons satisfy several properties, such as:

- $K^*$ is closed and convex
- $K_1 \subseteq K_2$ imples $K_2^* \subseteq K_1^*$
- $K^{**}$ is the closure of the convex hull of $K$ (Hence if $K$ is convex and closed, $K^{**}=K$)

Thsese properties show that if $K$ is a proper cone, then so is its dual $K^{*}$, and moreover, that $K^{**}=K$

dual cones of proper cones are proper, hence define generalized inequalities:

Some import properties relating a generalized inequality and its dual are:

- $x\preceq_K y$ iff $\lambda^Tx \le \lambda^Ty$ for all $\lambda \succeq_{K^{*}} 0$
- $x\prec_K y$ iff $\lambda^Tx < \lambda^Ty$ for all $\lambda \succ_{K^{*}} 0, \lambda\ne0$

Since $K=K^{**}$, the dual generalized inequality associated with $\preceq_{K^{*}}$ is $\preceq_K$, so these properties hold if the generalized inequality and its dual are swapped

**dual characterization of minimum element** w.r.t. $\preceq_K$: $X$ is minimum element of $S$ iff for all $\lambda \succ_{K^*}0$, $x$ is the unique minimizer of $\lambda^Tz$ over $z\in S$

**dual characterization of minimal element** w.r.t. $\preceq_K$:

- if $x$ minimizes $\lambda^Tz$ over $S$ for some $\lambda \succ_{K^*}0$, then $x$ is minimal
- if $x$ is a minimal element of a convex set $S$, then there exists a nonzero $\lambda \succeq_{K^*}0$ such that $x$ minimizes $\lambda^Tz$ over $z \in S$

For that each class’s density follows multivariate Gaussian

Take the logarithm of $f_k(x)$, we get

where $c = -\log [(2\pi)^{p/2}\lvert\Sigma\rvert^{1/2}]$ and $\mu_k^T\Sigma^{-1}x=x^T\Sigma^{-1}\mu_k$. Following the above formula, we can derive **Equation (4.9)** easily

It’s stated in the book that the LDA classifier can be implemented by the following pair of steps:

- Sphere the data with respect to the common covariance estimate $\hat \Sigma:X^*\leftarrow D^{-1/2}U^TX$, where $\hat \Sigma=UDU^T$. The common covariance estimate of $X^*$ will now be the indentity.
- Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities $\pi_k$.

However, detailed explanation is not given in the book. Here, I give some skipped mathematical steps which may help the understanding.

which shows that the covariance estimate of $X^*$ is the identity.

Note that the classification for LDA is based on the linear discriminat functions

which is the **Equation (4.10)** in the book. Since the input $x$ is same for each class, so we can add back a term $\frac12x^T\Sigma^{-1}x$ which is cancelled in the previous derivation. Now the functions are turned into:

We know that $\Sigma=I$ in the transformed space, so $\delta_k(x)=-1/2\lVert x-\mu_k\rVert_2+\log\pi_k$. And $\mu_k$ is the centroid for the $k$th class. The claimed method to classify is proved.

In the two-class case, $p_1(x;\beta)=p(x;\beta)$ and $p_2(x;\beta) = 1-p(x;\beta)$ where

The **Equation (4.21)** can be derived easily as follows,

Note that

Plug it into **Equation (4.21)**, we get

The least squares estimate of $\beta$ is given by the book’s **Equation (3.6)**

From the previous post, we know that $\mathrm{E}(\mathbf{y})=X\beta$. As a result, we obtain

Then, we get

The variance of $\hat \beta$ is computed as

If we assume that the entries of $\mathbf{y}$ are uncorrelated and all have the same variance of $\sigma^2$, then $\mathrm{Var}(\varepsilon)=\sigma^2I_N$ and the above equation becomes

This completes the derivation of **Equation (3.8)**.

There are a lot concepts of statistics in this part. It’s better to go through Chapter 6 and Chapter 10 in *All of Statistics* to have a taste about hypothesis tests and confidence intervals.

From my own viewpoint, Z-score and F-statistic give a measure about whether the corresponding features are useful or not. They can be used within some feature selection methods. However, they’re not very useful in practice. The perferred feature selection methods are discussed in **Section 3.3** in the book.

which completes the derivation of **Equation (3.20)**.

**Equation (3.22)** shows that the expected quadratic error can be broken down into two parts as

The first error component $\sigma^2$ is unrelated to what model is used to describe our data. It cannot be reduced for it exists in the true data generation process. The second source of error corresponding to ther term $\text{MSE}(\tilde f(x_0))$ represents the error in the model and is under control of us. By **Equation (3.20)**, the mean square error can be broken down into two terms: a model variance term and a model bias squared term. How to make these two terms as small as possible while considering the trade-offs between them is the central topic in the book.

The first thing that comes to my mind when I read this section is that why we need this when we already have the ordinary least square (OLS) estimate of $\beta$:

It’s because we want to study how to obtain orthogonal inputs instead of correlated inputs, since orthogonal inputs have some nice properties.

Following Algorithm 3.1, we can transform the correlated inputs $\mathbf{x}$ to the orthogonal inputs $\mathbf{z}$. Another view is that we form an orthogonal basis by performing the Gram-Schmidt orthogonilization procedure on $X$’s column vectors and obtain an orthogonal basis $\mathbf{z}_{i=1}^p$. With this basis, linear regression can be done simply as in the univariate case as shown in **Equation (3.28)**:

Following this equation, we can derive **Equation (3.29)**:

We can write the Gram-Schmidt result in matrix form using the QR decomposition as

In this decomposition $Q$ is a $N\times(p+1)$ matrix with orthonormal columns and $R$ is a $(p+1)\times(p+1)$ upper triangular matrix. In this representation, the OLS estimate for $\beta$ can be written as

which is **Equation (3.32)** in the book. Following this equation, the fitted value $\mathbf{\hat y}$ can be written as

which is **Equation (3.33)** in the book.

If we compute the singular value decomposition (SVD) of the $N\times p$ centered data matrix $X$ as

where $U$ is a $N \times p$ matrix with orthonormal columns that span the column space of $X$, $V$ is a $p \times p$ orthogonal matrix, and $D$ is a $p \times p$ diagonal matrix with elements $d_j$ ordered such that $d_1\ge d_2 \ge \dots \ge d_p \ge 0$. From this representation of $X$ we can derive a simple expression for $X^TX$:

which is the **Equation (3.48)** in the book. Using this expression, we can compute the least squares fitted values as

which is the **Equation (3.46)** in the book. Similarly, we can find solutions for ridge regression as

which is the **Equation (3.47)** in the book. Since we can estimate the sample variance by $X^TX/N$, the variance of $\mathbf{z}_1$ can be derived as follows:

which is the **Equation (3.49)** in the book. Note that $v_1$ is the first column of $V$ and $V$ is orthogonal, so that $V^Tv_1$ is $[1,0, \dots, 0]^T$.

The degrees-of-freedom of the fitted vector $\mathbf{\hat y}=(\hat y_1, \dots, \hat y_N)$ is defined as

in the book. Also, it’s claimed that $\text{df}(\mathbf{\hat y})$ is $k$ for ordinary least squares regression and $\text{tr}(\mathbf{S}_{\lambda})$ for ridge regresssion without proof in the book. Here, we’ll derive these two expressions. First, we define $e_i$ as a $N$-element vector of all zeros with a one in the $i$th spot. It’s easy to see that $\hat y_i=e_i^T\mathbf{\hat y}$ and $y_i=e_i^T\mathbf{y}$, so that

For OLS regression, we have $\mathbf{\hat y}=X(X^TX)^{-1}X^T\mathbf{y}$, so the above expression for $\text{Cov}(\mathbf{\hat y}, \mathbf{y})$ becomes

Thus,

where $x_i=X^Te_i$ is the $i$th row of $X$ or $i$th sample’s feature vector. According to the given formula, we get

If you’re not familar the basic properties of trace, you can refer to this page. Note that

Thus, when there are $k$ predictors we get

the claimed result for OLS in the book. Similarly for ridge regression,

which is the **Equation (3.50)** in the book.

The expected predicted error (EPE) under the squared error loss:

Taking derivatives with respect to $\beta$:

In order to minimize the EFE, we make derivatives equal zero which gives **Equation (2.16)**:

*Note: $x^T\beta$ is a scalar, and $\beta$ is a constant.*

There are $N$ $p$-dimensional data point $x_1,\dots, x_N$, that is, $N\times p$ dimensions in total. Let $r_i=\Vert x_i \Vert$. Without loss of generality, we assume that $A < r_1 < \dots < r_n < 1$. Let $U(A)$ be the region of all possible sampled data which meet the assumptation:

The goal is to find $A$ such that $U(A)=\frac12U(0)$. It turns out to be a integration problem on a $N \times p$ dimensional space.

With some mathematical techniques (which make me overwhelmed), we can get $U(A)=(1-A^p)^N$. Then $U(0)=1$. Solving $(1-A^p)^N=1/2$, we obtain **Equation (2.24)**:

The variation is over all training sets $\mathcal{T}$, and over all values of $y_0$, while keeping $x_0$ fixed. Note that $x_0$ and $y_0$ are chosen independently of $\mathcal{T}$ and so the expectations commute: $\mathrm{E}_{y_0\vert x_0}\mathrm{E}_{\mathcal{T}}=\mathrm{E}_{\mathcal{T}}\mathrm{E}_{y_0 \vert x_0}$. Also $\mathrm{E}_\mathcal{T}=\mathrm{E}_\mathcal{X}\mathrm{E}_{\mathcal{Y \vert X}}$.

In order to make the derivation more comprehensible, here lists some definitions:

$y_0-\hat y_0$ can be written as the sum of three terms:

Following above definitions, we have $U_1=\varepsilon$, $U_3=0$. In addition, clearly we have $\mathrm{E}_\mathcal{T}U_2=0$. When squaring $U_1-U_2-U_3$, we can eliminate all three cross terms and one squared terms $U_3^2$.

Following the definition of variance, we have: $\mathrm{E}_{y_0\vert x_0}\mathrm{E}_\mathcal{T}U_1^2=\mathrm{Var}(\varepsilon)=\sigma^2$ and $\mathrm{E}_\mathcal{T}(\hat y_0 - \mathrm{E}_\mathcal{T}\hat y_0)^2=\mathrm{Var}_\mathcal{T}(\hat y_0)$.

Since $U_2=\sum_{i=1}^Nl_i(x_0)\varepsilon_i$, we have $\mathrm{Var}_\mathcal{T}(\hat y_0)=\mathrm{E}_\mathcal{T}U_2^2$ as

Since $\mathrm{E}_\mathcal{T}\varepsilon\varepsilon^T=\sigma^2I_N$, this is equal to $\mathrm{E}_\mathcal{T}x_0(X^TX)^{-1}x_0\sigma^2$. This completes the derivation of **Equation (2.27)**.

Under the conditions stated by the authors, $X^TX/N$ is then approximately equal to $\mathrm{Cov}(X)=\mathrm{Cov}(x_0)$. Applying $\mathrm{E}_{x_0}$ to $\mathrm{E}_\mathcal{T}x_0(X^TX)^{-1}x_0\sigma^2$, we obtain (approximately)

This completes the derivation of **Equation (2.28)**.

为了更加准确的理解上述两个定理，我们需要理解概率层面的收敛，而非微积分里的收敛（如果对与任意$\epsilon>0$和足够大的$n$，$\vert x_n -x \rvert < \epsilon$，那么我们称这一实数列$x_n$收敛于极限$x$）。

在统计中，主要有两种类型的收敛：

令$X_1,X_2,\dots$为一系列随机变量并令$X$为另一个随机变量。令$F_n$表示$X_n$的概率密度函数 (CDF)，$F$表示$X$的概率密度函数。 1) $X_n$在概率上收敛于$X$ (converges in probability)，写作$X_n\xrightarrow[]{P} X$，如果对于任意$\epsilon>0$

当$n\rightarrow\infty$。 2) $X_n$在分布上收敛于$X$ (converges in distribution), 写作$X_n\rightsquigarrow X$，如果对于任意在$F$中连续的点$t$，

另外由$X_n\xrightarrow[]{P}X$可以推出$X_n\rightsquigarrow X$。

P.S. 其实还有另外两种类型的收敛，他们之间的关系也更加复杂，这里的重点是介绍两个定理，所以这部分从简，想深入了解可参考相关教材。

令$X_1,X_2\dots$为独立同分布 (IID) 的样本，令$\mu=\mathbb{E}(X_1)$，$\sigma^2=\mathbb{V}(X_1)$。另外$\bar X_n = n^{-1}\sum_{i=1}^nX_i$为样本均值并且$\mathbb{E}(\bar X_n)=\mu$，$\mathbb{V}(\bar X_n)=\sigma^2/n$。

弱大数定理(WLLN)：如果$X_1,\dots, X_n$为独立同分布，那么$\bar X_n \xrightarrow[]{P} \mu$。

理解：当$n$越来越大时，$\bar X_n$的分布变得越来越聚集于$\mu$附近。

中心极限定理(CLT)：如果$X_1,\dots,X_n$为独立同分布，那么

其中$Z \sim N(0,1)$，即正态分布。

理解：关于$\bar X_n$的概率表达式可以近似于正态分布。注意并不是随机变量本身近似于正态分布。

然而大多数时候我们并不知道$\sigma$，实际中我们可以用标准差$S_n^2=\frac1{n-1}\sum_{i=1}^n(X_i-\bar X_n)^2$来代替$\sigma$。

定理：在与CLT相同的条件下，

为了能更加广泛有效的利用中心极限定理，掌握Delta方法是相当有必要的。

Delta方法：

其中$g$是一个可导函数从而$g’(\mu)\ne0$。

P.S. 强大数定理，多元中心极限定理及多元Delta方法由于比较复杂就省略了。另外由于比较懒，所以有助于理解的例子也没用写，纯粹当是记录一下学习的过程了。

]]>假如我们用神经网络在MNIST数据集上训练了一个分类器，我们在测试集上得到了一个错误率，比如$0.05$。那么这是否意味着我们可以保证我们的神经网络一定能达到$95\%$的正确率呢？显然训练一次得出的结果是不可靠的。那么，我们有多大的把握（概率）来相信这一观察到的错误率呢？

这时候就需要一些统计的语言了：假设我们有$n$个测试样本，每个测试样本分类的正确与否都是一个随机变量$X_1,\dots,X_n$。如果分类错误$X_i=1$，否则$X_i=0$。显而易见，$\bar X_n=n^{-1}\sum_{i=1}^nX_i$就是观察到的错误率。我们可以把每个$X_i$当做一个均值为$p$服从Bernoulli分布的随机变量，从而$p$就是真正（但是永远无法准确知晓）的错误率。从我们的角度来看，我们希望$\bar X_n$应该接近$p$。那么$\bar X_n$和$p$的概率超过一个固定值$\epsilon$的概率有多大呢？ 这个概率就是$\mathbb{P}(\lvert\bar X_n -p\rvert > \epsilon)$，通常我们很难直接计算出它的值，这时我们就需要不等式来给这个概率设定一些边界 (bound)。

我们的第一个不等式就是马尔科夫不等式 (Markov’s inequality)：

令$X$为一个非负随机变量并假设$\mathbb{E}(X)$存在。对任何$t>0$，

咋一看，我们似乎无法直接运用马尔科夫不等式来限定$\mathbb{P}(\lvert\bar X_n -p\rvert > \epsilon)$的值。但其实只要稍做转换，便可得到另一个可以直接使用的不等式，即切比雪夫不等式 (Chebyshev’s inequality)：

令$\mu = \mathbb{E}(X)$和$\sigma^2=\mathbb{V}(X)$，从而

这个不等式可以直接从马尔科夫不等式得出：

由于$X_i$服从Bernoulli分布，所以$\mathbb{V}(\bar X_n)=\mathbb{V}(X_i)/n=p(1-p)/n$，从而

注意对任意$0<p<1$，$p(1-p)\le\frac14$。如果我们希望神经网络的真实错误率与观察到的错误率之间的误差超过$\epsilon=0.05$的概率不超过$0.05$，那么通过简单的计算可得我们需要大约$n=2000$个测试样本。怎么样，是不是觉得不等式变得有用了？

切比雪夫不等式只是一个相对粗略的估算，其实还有各种更为精确的不等式。当然，随之然来的是各种各样的限制条件，这里给出一个更精确的霍夫丁不等式 (Hoeffding’s inequality)：

令$X_1\dots X_n \sim \mathrm{Bernoulli}(p)$，对任意$\epsilon>0$，

其中$\bar X_n = n^{-1}\sum_{i=1}^nX_i$。

通过上面的不等式，经过计算发现其实我们只需要$738$个测试样本就足够了。

P.S. 上面的Hoeffding’s inequality只是针对Bernoulli变量的特殊形式，完整的不等式可以参考Wikipedia或相关教材。

经过一个小例子，我们对不等式的作用有了一个直观的认识。但不等式真正的用武之地是在各种推导证明之中的，虽然我看到那种满篇公式、各种bound来bound去的paper都是自动略过的，而且现在在做的东西也是偏应用层面。但谁知道将来在研究中会不会经常用到呢，至少，现在我学会了估算可靠的测试样本大小的方法。

]]>年份 | 面试人数 | 录取人数 | 不合格的人数 | 不合格率 |
---|---|---|---|---|

2014 | 1000 | 350 | 30 | 8.57% |

2015 | 1000 | 650 | 10 | 1.54% |

2016 | 1000 | 200 | 10 | 5.00% |

那么这家公司的招聘策略是否有效呢？

显而易见，公司的招聘标准是根据解出题目的多寡来判断应试者是否合格。用统计的语言的说，就是每个应试者录取与否是一个随机变量 (Random Variable) $X_i$, 其样本空间 (Sample Space) 是${0, 1}$，其中$0$代表不合格，$1$代表合格。而这些随机变量均服从于一个由最少解出题数$\theta$决定的概率分布$p(x;\theta)$。用数学的语言就是$X_1,\dots,X_n \sim p(x;\theta)$。公司策略所假设的概率分布是理想化的（不切实际的），对于给定$\theta=\theta_0$, $p(x;\theta_0)$的概率分布可以用下表表示：

$p(x;\theta)$ | $x=0$ | $x=1$ |
---|---|---|

$\theta<\theta_0$ | 1 | 0 |

$\theta\ge\theta_0$ | 0 | 1 |

为了判断招聘策略（假设概率分布$p$)的有效性，我们需要针对概率分布的参数，即最少解出题数$\theta$进行假设。首先，我们提出一个空假设 (Null Hypothesis) $H_0: \theta=\theta_0$和一个替代假设 (Alternative Hypothesis) $H_1: \theta \ne \theta_0$。而假设检验需要考虑的问题并非$H_0$是对是错，而是我们是否有足够的证据来证明$H_0$是错的。

那么去哪里找证据呢？当然是观察实际数据啦。每当我们观察到一组数据时，我们需要确定一些指标来支撑我们的判断。对于公司招聘来说，最直观的指标就是不合格率$1-\bar X$了。这里我们需要设定一个标准来决定什么时候来拒绝$H_0$，即我们的最低期望，比方说不合格率高于$c$就拒绝$H_0$。

所谓检验力$\beta(\theta)$，就是在给定$\theta$下拒绝$H_0$的概率。针对招聘问题，就是$\beta(\theta)=P_\theta(1-\bar X >c)$。而显著性水平$\alpha$就是在$H_0$成立的条件下，允许的$\beta(\theta)$的最大值。有点绕是不是？总结一下，其实$\alpha$就是“当前招聘策略下，不合格率高于$c$的最大概率”。由于在招聘问题的$H_0$下$\theta$只有一个取值$\theta_0$，所以$\alpha=\beta(\theta_0)$。需要注意的是，$\alpha$是由$c$决定的，即提前人为设定的。

P值$p$是我们会拒绝$H_0$时能接受最小的$\alpha$，换言之，当$\alpha > p$时，我们便会拒绝$H_0$。针对招聘问题，如果我们希望不合格率不得高于$5\%$，即$c=0.05$，$\alpha=\beta(\theta_0)=P_{\theta_0}(1-\bar X > 0.05)$。当不合格率高于$5\%$的概率高于$p$时，就认为当前的招聘策略无效。所以，$p$越小，证明$H_0$是错的证据就需要越有力。正因如此，才在学界有了“$p$值为$0.05$，即可将统计结果视为显著”这样的规则。当然，不要因此而误认为$p=P(H_0)$，即招聘策略有效 ($H_0$正确) 的概率。

关于P值，还有一个很不靠谱的特征：那就是当$H_0$实际上是正确的时候，$p$服从$0-1$均匀分布。也就是说，$p$值即使很小也并一定意味着$H_0$一定是错的，而有可能只是碰巧发生的。还真是够不靠谱的，无怪乎会被各种吐槽。

最后，回归到招聘策略是否有效的问题。根据$P_{\theta_0}(1-\bar X > 0.05)$，$\alpha=0.333$，然后由于只有三条观察数据，$p$值在这个问题上并没有太大的参考价值。对于其他很多问题来说，同样也是如此。总而言之就是P值虽然被广泛使用，然后大多数情况下并没有什么卵用。

]]>另外，创造智械还会带来一系列现实问题，诸如道德，法律，就业等等问题。就比如无人车车祸的责任判定，机器取代工人而引发的大量失业等等。

总而言之，与其创造一个与人类等价的新族类“智械”，不如思考如何利用人工智能更好的辅助人类，定向地将人类进化的更加强大。相信这是在可预见的未来，人工智能领域的一个趋势。

]]>虽然成书于94年，但书中种种观点和当今社会与技术的发展却有诸多不谋而合之处。读罢前几章，最为深刻的印象就是蜂群思维，分布式系统，去中心化等等一系列相关的概念。总而言之，论述的是一种与传统自上而下的系统相悖的自下而上的系统。这里的系统是一个非常宽泛的说法，它可以是机器，可以是软件，也可以是动物，乃至于人类社会、政治体系、万维网等等。可能是我实在是孤陋寡闻，在阅读这本书之前，我潜意识里确实认为绝大多数系统都应该有一个中心，拥有绝对的权威并下达指令，比如PC的CPU，古代的皇帝，人的大脑等等。然而这本书却提出了一个截然不同的系统，简而言之就是没有一个绝对的中心，每一个个体的行为决定了整体的行动（也可以简单的理解为少数服从多数，但实际情况往往更加复杂）并由简单的行为（操作）逐层向上模块化的增加更加复杂的行为（操作），另外分布式的存在方式令其拥有更强的容错性和在部分失灵的情况下能够继续运转的稳定性。现代科学在群体动物（蜂群、蚁群）中发现了这样的系统，而脑科学的发展也说明大脑并非我们原本想象的那样控制着人体的一切。无数的神经的共同作用造就了大脑，而大脑与身体的各种感官也似乎并非简单的从属关系。

这样的系统的优越性早已在工业界得到了印证。我也在这记录一下我在阅读过程中所联想到的各种相干（或者不相干？）的点点吧：

- Ensemble Learning的motivation似乎和蜂群思维不谋而合。
- 神经网络也很好的体现了这个系统的特征，比如mixture of experts，dropout等等技巧的成功应用。
- 算法的泛化能力 = 系统的容错能力？
- MXNet的开发就在强调去中心化和模块化，具体进展如何，拭目以待。不过我暂时还是出于易用性考虑继续站tensorflow。
- 社交网络时代下的我们就是群氓中的一份子，个性化推荐加剧了信息不对称。
- 极端的民主并不利于文明的发展，强大的容错性的代价是无意义的徘徊和拉锯战，正如通过大量重复计算来换取更强的泛化能力。美国的大选或许就是个例子。

微博上看到有人喷这本书的作者就是个大忽悠，对于作者现在到哪儿去演讲、圈钱、布道什么的不做评价。但上个世纪的书对当下仍有现实启发意义，我觉得这本书还是值得一读的。

]]>Both TD and Monte Carlo methods use experience to solve the prediction problem. Given some experience following a policy $\pi$, both methods update their estimate $v$ of $v_\pi$ for the nonterminal states $S_t$ occurring in that experience. Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$ (only then is $G_t$ known), TD methods need wait only until the next time step. The simplest TD method, known as TD(0), is

TD methods combine the sampling of Monte Carlo with the bootstrapping of DP. As we shall see, with care and imagination this can take us a long way toward obtaining the advantages of both Monte Carlo and DP methods.

Note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_t+\gamma V(S_{t+1})$. This quantity, called the TD error, arises in various forms throughout reinforcement learning:

Also note that the Monte Carlo error can be written as a sum of TD errors:

This fact and its generalizations play important roles in the theory of TD learning.

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer. Updates are made only after processing each complete batch of training data. We call this batch updating.

Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In this case, the maximum-likelihood estimate is the model of the Markov process formed in the obvious way from the observed episodes: the estimated transition probability from $i$ to $j$ is the fraction of observed transitions from $i$ that went to $j$, and the associated expected reward is the average of the rewards observed on those transitions. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated. In general, batch TD(0) converges to the certainty-equivalence estimate.

As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part.

In the previous section we considered transitions from state to state and learned the values of states. Now we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values:

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning, defined by:

In this case, the learned action-value function, $Q$, directly approximates $q^*$, the optimal action-value function, independent of the policy being followed.

Consider the algorithm with the update rule:

but that otherwise follows the schema of Q-learning. Given the next state $S_{t+1}$, this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and accordingly it is called expected Sarsa.

All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. We call this maximization bias.

One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two independent estimates, call them $Q_1(a)$ and $Q_2(a)$, each an estimate of the true value $q(a)$, for all $a\in\mathcal{A}$. We could then use one estimate, say $Q_1$, to determine the maximizing action $A^*=\mathrm{argmax}_aQ_1(a)$, and the other, $Q_2$, to provide the estimate of its value, $Q_2(A^*)=Q_2(\mathrm{argmax}_aQ_1(a))$. This estimate will then be unbiased in the sense that $\mathbb{E}[Q_2(A^*)]=q(A^*)$. We can also repeat the process with the role of the two estimates reversed to yield a second unbiased estimate $Q_1(A^*)=Q_1(\mathrm{argmax}_aQ_2(a))$. This is the idea of doubled learning.

]]>An obvious way to estimate the state-value function which is the expected return starting from that state from experience, is to average the returns observed after visits to that state. This idea underlies all Monte Carlo methods.

In particular, suppose we wish to estimate $v_\pi(s)$, the value of a state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing through $s$. Each occurrence of state $s$ in an episode is called a visit to $s$. The first-visit MC method estimates $v_\pi(s)$ as the average of the returns following from first visits to $s$, whereas every-visit MC method averages the returns following all visits to $s$.

An important fact about Monte Carlo method methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP.

If a model is not available, then it is particularly useful to estimate action values rather than state values. The Monte Carlo methods for this this are essentially the same as just presented for state values, except now we talk about visits to a state–action pair rather than to a state. A state–action pair $s, a$ is said to be visited in an episode if ever the state $s$ is visited and action $a$ is taken in it.

The only complication is that many state–action pairs may never be visited. This is the general problem of maintaining exploration.

The overall idea of how Monte Carlo estimation can be used in control is to according to the idea of generalized policy iteration (GPI). In GPI one maintains both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function.

We made two unlikely assumptions in order to easily obtain guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.

The assumption that policy evaluation operates on an infinite number of episodes are relatively easy to remove. One of the approaches it to forgo trying to complete policy evaluation before returning to policy improvement. For Monte Carlo policy evaluation it is natural to alternate between evaluation and improvement on an episode-by-episode basis.

The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas on-policy methods evaluate or improve a policy different from that used to generate the data.

In on-policy control methods the policy is generally soft, meaning that $\pi(a \vert s)>0$ for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$, but gradually shifted closer and closer to a deterministic optimal policy.

All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions (to find the optimal actions). How can they learn about the optimal policy while behaving according to an exploratory policy? The on-policy approach in the preceding section is actually a compromise. It learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to gen- erate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

Suppose we wish to estimate $v_\pi$ or $q_\pi$, but we have all we have are episodes following another policy $\mu$, where $\mu \ne \pi$. In this case, $\pi$ is the target policy, $\mu$ is the behavior policy, and both policies are considered fixed and given. We require that $\pi(a \vert s)>0$ implies $\mu(a \vert s)>0$. This is called the assumption of coverage.

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t,S_{t+1},A_{t+1},\dots,S_T$, occurring under any policy $\pi$ is

Thus, the relatvie probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is

We can define the set of all time steps in which state is visited, denoted $\mathcal{J}(s)$. This is for an every-visit method; for a first-visit method, $\mathcal{J}(s)$ would only include time steps that were first visits to $s$ within their episodes. Also, let $T(t)$ denote the first time of termination following time $t$, and $G_t$ denote the return after $t$ up through $T(t)$. Then $\{G_t\}_{t\in\mathcal{J}(s)}$ are the returns that pertain to state $s$, and $\{\rho_t^{T(t)}\}_{t\in\mathcal{J}(s)}$ are the corresponding importance-sampling ratios. To estimate $v_{\pi}(s)$, we simply scale the returns by the ratios and average the results:

When importance sampling is done as a simple average in this way it is called ordinary importance sampling.

An import alternative iis weighted importance sampling, which uses a weighted average, defined as

or zero if the denominator is zero.

The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased. On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. In fact, assuming bounded returns, the variance of the weighted importance-sampling estimator converges to zero even if the variance of the ratios themselves is infinite. In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred.

Suppose we have a sequence of returns $G_1, G_2, \dots, G_{n-1}$, all starting in the same state and each with a corresponding random weight $W_i$ (e.g., $W_i=\rho_t^{T(t)}$).

Consider a sequence of approximate value functions $v_0, v_1, v_2, \dots,$ each mapping $\mathcal{S}^+$ to $\mathbb{R}$. The initial approximation, $v_0$ is chosen arbitrarily, and each successive approximation is obtained by using the Bellman equation for $v_\pi$ as an update rule:

for all $s\in\mathcal{S}$.

Let $\pi$ and $\pi’$ be any pair of deterministic policies such that, for all $s\in\mathcal{S}$,

Then the policy $\pi’$ must be as good as, or better than, $\pi$. That is, it must obtain greater or equal expected return from all states $s\in\mathcal{S}$:

This result is called policy improvement theorem.

Consider the new greedy policy, $\pi’$, given by

The greedy policy takes the action that looks best in the short term according to $v_\pi$. By construction, the greedy policy meets the conditions of the policy improvement theorem. The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

In addition, policy improvement must give us a strictly better policy excepy when the original policy is already optimal.

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the grandularity and other details of the two processes.

]]>- The agent and environment interact at each of a sequence of discrete time steps, $t=0,1,2,3,\dots$
- At each time step $t$, the agent receives some representation of the environment’s state, $S_t\in\mathcal{S}$, where $\mathcal{S}$ is the set of possible states.
- On that basis, the agent selects an action, $A_t \in \mathcal{A}(S_t)$, where $\mathcal{A}(S_t)$ is the set of actions available in state $S_t$.
- One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$, and finds itself in a new state, $S_{t+1}$.

At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted $\pi_t$, where $\pi_t(a \vert s)$ is the probability that $A_t=a$ if $S_t=s$. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent’s goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.

At each time step, the reward is a simple number, $R_t \in \mathbb{R}$. Informally, the agent’s goal is to maximize the total amount of reward it receives. If the sequence of rewards received after time step $t$ is denoted $R_{t+1}, R_{t+2}, R_{t+3}, \dots$, we seek to maximize the expected return, where the return $G_t$ is defined as some specific function of the reward sequence.

Here we define the return as:

where $T$ can be $\infty$ and $0<\gamma\le1$ is the discounting rate.

A reinforment learning task that satisfies the Markov preperty is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).

Given any state and action $s$ and $a$, the probability of each possible pair of next state and reward, $s’$, $r$, is denoted

These quantities completely specify the dynamics of a finite MDP.

The value of a state $s$ under a policy $\pi$, denoted $v_\pi(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs, we can define $v_\pi(s)$ formally as

where $\mathbb{E}[\centerdot]$ denotes the expected value of a random variable given that the agent follows policy $\pi$, and $t$ is any time step. We call the function $v_\pi$ the state-value function for policy $\pi$.

Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted $q_\pi(s,a)$, as the expected return starting from $s$, taking the action $a$, and thereafter following policy $\pi$:

We call $q_\pi$ the action-value function for policy $\pi$.

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy particular recursive relationships. For any policy $\pi$ and any state $s$, the following consistency condition holds between the value of $s$ and the value of its possible successor states:

This equation is the Bellman equation for $v_\pi$. Likewise, the Bellman equation for action values $q_\pi$ is as follows:

According to the Bellman equations, we can derive the relatioship between $v_\pi$ and $q_\pi$:

A policy $\pi$ is defined to be better than or equal to a policy $\pi’$ if its expected return is greater than or equal to that of $\pi’$ for all states. In other words, $\pi\ge\pi’$ if and only if $v_\pi(s)\ge v_{\pi’}(s)$ for all $s\in\mathcal{S}$. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy $\pi_*$.

The optimal state-value function, denoted $v_*$, are defined as

for all $s\in\mathcal{S}$.

The optimal action-value function, denoted $q_*$, are defined as

for all $s\in\mathcal{S}$ and $a\in\mathcal{A}(s)$.

The Bellman optimalality equation for $v_*$ and $q_*$ is