10.5 Why Exponential Loss?
Derivation of Equation (10.16)
Since $Y\in{-1,1}$, we can expand the expectation as follows:
In order to minimize the expectation, we equal derivatives w.r.t. $f(x)$ as zero:
which gives:
Notes on Equation (10.18)
If $Y=1$, then $Y’=1$, which gives
Likewise, if $Y=-1$, then $Y’=0$, which gives
As a result, the binomial log-likelihood loss is equivalent to the deviance. In the language of neural networks, the cross-entropy is equivalent to the softplus. The only difference is that $0$ is used to indicate negative examples in cross-entropy; while $-1$ is used in softplus.
10.6 Loss Functions and Robustness
This section explains the choice of loss functions for both classification and regression. It gives a very direct expalanation about why square loss is undesirable for classification. Highly recommended!