Billy Ian's Short Leisure-time Wander

into language, learning, intelligence and beyond

[Notes on Mathematics for ESL] Chapter 6: Kernel Smoothing Methods

| Comments

6.1 One-Dimensional Kernel Smoothers

Notes on Local Linear Regression

Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:

The estimate is $\hat f(x_0)=\hat\alpha(x_0)+\hat\beta(x_0)x_0$. Define the vector-value function $b(x)^T=(1,x)$. Let $\mathbf{B}$ be the $N \times 2$ regression matrix with $i$th row row $b(x_i)^T$, $\mathbf{W}(x_0)$ the $N\times N$ diagonal matrix with $i$th diagonal element $K_\lambda (x_0, x_i)$, and $\theta=(\alpha(x_0), \beta(x_0))^T$.

Then the above optimization problem can be rewritten as

Equal the derivative w.r.t $\theta$ as zero, we get


It’s claimed that $\sum_{i=1}^Nl_i(x_0)=1$ and $\sum_{i=1}^N(x-x_0)l_i(x_0)=0$ in the book, so that the bias $\text{E}(\hat f(x_0))-f(x_0)$ depends only on quadratic and higher-order terms in the expansion of $f$. However, the proof is not given. Here I will give the detailed derivations of these two equations.

First, define the following terms:

Then, we can represent the estimate as

When $y=\mathbf{1}$, $m_0=S_0$ and $m_1=S_1$, we get

When $y=\mathbf{x}-x_0$,

More generally, it’s easy to show that $\sum_{i=1}^N(x_i-x_0)^pl_i(x_0)=0$ when $p>0$.

We only prove the case when the input $x$ is one-dimensional. Similar strategy can be used to prove the case for high-dimensional input, but it’ll be a little bit complicated if you’re interested. Have fun!