As the already gigantic modern language models become ever larger by the day, people seem to ignore many existing works discussing their limitations. In this blog post, I try to connect the results and observations from several such works to form the “shortcut learning hypothesis” of those language models. Such a hypothesis implies that modern language models are fundamentally flawed to effectively capture long-range dependencies or complicated structures of the data, if we still stick with the current training objective. Hopefully, this post can help people take a step back and ponder a bit first before going all-in into the current scale competition of language models.
A 137B parameter language model fails to answer a simple question with prompting. (The example comes from this paper) |
A language model (LM) is an estimate $Q(x_{1:T})$ of the true underlying probability distribution $P(x_{1:T})$ over sequence of text $x_{1:T}=(x_1,\dots,x_T)$, consisting of tokens $x_t$ from a fixed vocabulary. Prevailing neural language models estimate the joint distribution $Q(x_{1:T})$ autoregressively which is implicitly defined by the conditional distributions $Q(x_t \vert x_{:t})$. Such an autoregressive factorization possesses several benefits:
It is standard for such a language model to be trained to minimize the cross entropy objective:
With such a formulation, the problem becomes one of the most basic learning tasks of predicting the next observation $x_t$ given a sequence of past observations $x_1,x_2,\dots,x_{t-1}$. And the objective is basically to minimize the “average error (uncertainty / entropy)” on the token level, which is commonly referred to as “perplexity” in the literature. Masked token prediction, a popular variant of the standard objective, first proposed in the BERT paper breaks the autoregressive factorization, but its characteristic of “average error” remains unchanged.
For the natural languages, the true distributions $P$ exhibit complex interactions between distant observations, i.e., long-range dependencies. Modern LMs based on the Transformer architecture have achieved tremendous successes in NLP by pushing the “average error” low enough. However, these large models still struggle to effectively capture long-range dependencies, for example, generating long and coherent stories, answering questions depending on long context or robustly performing numerical reasoning.
One limitation of the transformer-based LMs is the fixed number of tokens they can encode at once, and increasing this number linearly introduces a quadratic computational cost. Recently, a lot of efforts are dedicated to make the Transformer architecture more efficient to encode longer context, in the hope to help capture long-range dependencies. However, one recent EMNLP paper showed that whether encoding longer context can help capture long-range dependencies remains unclear. It seems that the LMs still largely rely on the most recent observations to make their predictions even though they have direct access to far distant past.
The STOC paper “Prediction with a Short Memory” presents an interesting theoretical result. The paper is quite technical, but the key result is quite intuitive. Here is the most important proposition (to me) from the paper:
Let $\mathcal{M}$ be any distribution over sequences with mutual information $I(\mathcal{M})$ between past observations $\dots,x_{t-2},x_{t-1}$ and future observations $x_t, x_{t+1},\dots$. The best $l$-th order Markov model, which makes predictions based only on the most recent $l$ observations, predicts the distribution of the next observation with average KL error $I(\mathcal{M})/l$, with respect to the actual conditional distribution of $x_t$ given all past observations.
Essentially, it shows that a Markov model – a model that cannot capture long-range dependencies or structure of the data – can predict accurately on any data-generating distribution, provided the order of the Markov model scales with the complexity of the distribution, as parameterized by the mutual information between the past and future. Strikingly, this parameterization is indifferent to whether the dependencies in the sequence are relatively short-range or very long-range. Independent of the nature of these dependencies, provided the mutual information is small, accurate prediction is possible based only on the most recent few observations.
Intuitively, it means that the “average error” can be pushed low enough without capturing long-range dependencies at all, by only doing well on the time steps when prediction relies little on long-range dependencies. There is one condition to make this argument valid, that is, the amount of long-range dependencies is small. To get a sense on whether this condition is met, let’s take a partition of the sequence $x_{1:T}$ into $A = x_{1:t}$ and $B=x_{t+1:T}$. Then, the cross entropy objective is equivalent to:
As $T$ increases, the configuration space of $B$ grows exponentially $\lvert\mathcal{B}\vert \sim d^{T-t}$, where $d$ is the vocabulary size and $\mathcal{B}$ is the set of all possible instances of $B$. However, with one specific instance of $A$ fixed, the amount of possible dependencies with $\mathcal{B}$ remains relatively small. As a result, the dependencies between $A$ and $B$ for large $T$ are very rare. In short, long-range dependencies are likely very sparse on average in the data.
The above result and intuition imply that the “average error”, though ubiquitously used in practice, is not a good metric to train and evaluate the LMs, if we are interested in capturing long-range dependencies. As long as the number of dependencies is not too large (usually valid), models with no capability to capture long-range dependencies can still perform well under the “average error”.
One may argue that such a result only reveals that models with a short memory can perform well measured by the “average error”. But it is not direct evidence that modern LMs trained with the “average error” are fundamentally flawed to capture long-range dependencies. Though without rigorous proof, I hypothesize that shortcut learning exists in modern LMs as a direct outcome of optimizing the “average error”. As put in the paper “Shortcut Learning in Deep Neural Networks”, “shortcut learning typically reveals itself by a strong discrepancy between intended and actual learning strategy, causing an unexpected failure”. The figure below shows a toy example of shortcut learning.
Toy example of shortcut learning. When trained on a simple dataset of stars and moons (top row), a standard neural network can easily categorise novel similar exemplars (middle row). However, tesing it on a slightly different dataset (bottom row) reveals a shortcut strategy: The network has learned to associate object location with a category. |
In the case of modern LMs, we hope that they will capture long-range dependencies naturally by scaling up with more and more parameters and data. However, with the “average error” as the main (usually, the only) objective, they learn unintended shortcuts by only leveraging recent observations but ignore most of the long-range dependencies. And as suggested by the above metioned theoretical result, such shortcuts indeed exist. (Such a phenomenon can also be interpreted as learning of spurious correlations in the data, encouraged by the inductive bias introduced by the “average error” objective.)
Actually, many empirical observations support the “shortcut learning hypothesis”. The most obvious evidence to me is the current large LMs’ inability to effectively capture long-range dependencies and understand complicated structures of the data, manifested in many complicated real-world tasks:
From my point of view, the models are learning in an unintended way that differs from what we, as humans, expect. Despite performing well under the “average error”, they try to generalize at places where memorizations are actually needed (e.g., factual information), and memorize at places where generalizations are actually needed (e.g., logic and reasoning). It is the typical characteristic of shortcut learning!
More clues are also observed in different quantitative ways. Our previous AISTATS paper found that there is a large discrepancy between $I(P)$ and $I(Q)$. Recall that $I(\mathcal{M})$ is the mutual information between past and future observations of any distribution $\mathcal{M}$, $P$ is the true distribution, and $Q$ is the model distribution. Basically, it shows that the learnt model distribution $Q$ exhibits much fewer long-range dependencies than the true distribution $P$. It may be the reason why the current language models are unable to generate long coherent documents. Long-range dependencies are largely lost when conditioning on the models’ own outputs.
In a similar vein, this ICML paper observed that the entropy rate of the model distribution $Q$
diverges quickly from the cross entropy objective
as the length of the generated sequences $T$ increases. Ideally, an accurate language model, we expect that $CE(P \Vert Q) \approx EntRate(Q)$. Such a divergence means that the language models become increasingly uncertain conditioned on their own outputs, even though they are able to push the “average error” to a very low level with respect to the true distribution. It also highlights that the learnt model distribution $Q$ ignores some crucial properties of the true distribution $P$. Why? Likely these properties, albeit important, do not offer much direct help to decrease the “average error”.
Both of these two observations provide additional empirical support that shortcut learning indeed exists for modern LMs, as a direct outcome of only optimizing the “average error”.
The shortcut learning hypothesis clearly suggests that a better metric/objective may be essential for modern LMs to better capture long-range dependencies or complicated structure of the data. One possibility suggested in the “Prediction with a Short Memory” paper is to only train and evaluate the models at a chosen set of (hard) time steps instead of all time steps. Hence the models can no longer do well with those unintended shortcuts.
Actually, a similar idea is already widely adopted in the practice of natural language processing. When GPT-3 and BERT first comes out, they seem to possess amazing “zero-shot” or “few-shot” transferrabilities. However, people soon realized such capabilities are largely overestimated, especially on harder tasks, e.g., question answering. Instead, “fine-tuning” the model weights with a downstream objective usually achieves much better performance. Just as revealed by the latest InstructGPT from OpenAI, a smart fine-tuning strategy can make small models outperform much larger models. The wide success of fine-tuning again validates the “shortcut learning hypothesis” and implies that only optimizing the “average error” is not enough in the end.
However, such a strategy is imperfect in many ways:
Many attempts to reslove these issues did not achieve much success, like multi-task pre-training and fine-tuning reported in the T5 paper from Google. It is unsurprising to me, since the annotated data, even combined across tasks, is tiny in size as compared to the raw data used for the unsupervised pre-training. As a result, fine-tuning is not good enough and we need a better unsupervised objective other than the “average error”.
While the “average error” is still the dominated unsupervised objective to train LMs, the original BERT actually introduced two promising directions to improve upon the “average error” metric:
Generally, I think both new ways to factorize the joint distribution $P(x_{1:T})$, and new unsupervised objectives beyond the “average error” are worth exploring further down the road. Another interesting direction to me is to leverage the massive information available on the web to serve as LM’s external memory, that is, retrieval-based LMs. Interestingly, three prestigious industrial AI labs all released their efforts in this direction recently. However, they all still rely either on the “average error” (RETRO from DeepMind) or fine-tuning on a human-annotated dataset (LaMDA from Google and WebGPT from OpenAI). This recent paper from Brain points out an intriguing direction to break down long complicated dependencies into short simple dependencies, though it’s completed through prompting. As suggested by this paper, prompting is unlikely to work on task semantics not close enough to the LM pre-training objective. But these initial efforts are definitely meaningful and provide us guidance towards a better LM paradigm which may circumvent the “shortcut learning hypothesis” to better capture long-range dependencies and understand complicated structures of the data.
Benefitting from the scaling success, modern LMs are becoming a general purpose model able to handle all kinds of tasks (through fine-tuning and prompting). As a result, their impact and implications are also becoming ever larger. In addition, the discussions of modern LMs are also becoming more and more controversial on the social media. As the stake is so high right now, some frank discussions about their limitations appear to be more indispensable, like this latest paper from Anthropic. Hopefully, this post can also contribute to such a purpose.
Thanks for the valuable feedback from Vatsal Sharan (USC/Stanford), Robert Geirhos (University of Tübingen), Yu Hou (USC), Guy Gur-Ari (X, Blueshift), Denny Zhou (Google Brain), Jeff Dean (Google Cloud ML / Google Research) and many other Google colleagues of mine.
]]>Figure 1: The AI Job Spectrum |
While the aforementioned posts both focus on the research end of the spectrum, this post covers the whole range of the spectrum from my own experience. Out of the five virtual onsites I attended, I received three offers. One of them was a Machine Learning Software Engineer / Research Engineer role from Google. The other two were both Machine Learning Scientist roles from a large tech and a large financial company respectively. For the offer from Google, the team match process is quite lengthy; I’ve talked with eight different teams inside Google. All those different positions cover the whole spectrum from heavily research focused to purely product driven, as shown in the figure below:
Figure 2: My options and their rough positions on the spectrum |
It’s a long grind from submitting resumes to making the final decision. During this process, I’ve had more than 20 rounds of discussion with different hiring managers, directors and other more senior people from these three companies. In addition, I’ve consulted with many friends of mine in the industry to get advice.
At the beginning of my job search, my preferrence was leaning towards the research end of the spectrum. However, as the things developed, my perception on the AI industry also changed gradually. Evenutally I choose a team in Google leaning towards the applied end of the spectrum. It’s a team in Cloud AI working on dialogue systems which heavily depends on the latest NLP technology (my expertise), generates a lot of value already and has great growth potential on tap. At the same time, there are also plenty of opportunities to collaborate with the research teams inside Google.
In this post, I’d like to share my whole journey accommpanied with some high-level suggestions and my thought process towards the final decision. Hopefully, this retrospect can provide some useful information for those who are passionate about AI and hope to find a matched position in the industry.
(In this post, I didn’t provide that much detail of my preparations, my interviews, my timelines, the companies and how I negotiated compensation. For a more comprehensive guidance to the ML interviews, I highly recommend the free Introduction to Machine Learning Interviews Book by Chip Huyen.)
Mordern machine learning has produced many landmark achievements (AlexNet, AlphaGo, GPT3, to name a few) in a very short period of time. As a result, AI became the darling of both academia and industry a few years ago. At the start, all seemed to be rosy: studying in AI promises a bright future; working in AI brings money and fame; and as astronomical amount of money is poured into the AI industry, ambitious goals like automonous driving and AGI looks to be only inches from our grasp. No wonder more and more talents around the globe are attracted to study and work on it.
Unfortunately, reality comes knocking eventually. A few years into the party, people suddenly realize that AI is still struggling to recover the invested capital. Even where it can bring utility and value, it also brings unexpected consequences, like issues of fairness and privacy. At the same time, the competition in both academia and industry grows fiercely, as more people are competing for fewer opportunities. And in this backdrop, I started my job-hunting journey.
At the time I started, I was a Machine Learning Researcher working at Borealis AI, an AI lab supported by RBC, Royal Bank of Canada. I am eternally grateful for my time at Borealis and the opportunties it provided. Since I joined, I was mentored and guided by great people and was privileged to do exciting ML research which I am passionate about. I was able to publish a few first-authored papers at top ML/NLP venues and grew quickly as an ML researcher and engineer. For my detailed background, you can check my homepage.
So with confidence that my resume was in good shape, I sent it out to around 20 companies. But things did not turn out that well as I expected. Only 7 companies responded positively and only 5 invited me for the virtual onsites. As been told by many and I can now attest, getting interviews may well be the hardest part of getting a job. As mentioned earlier, the AI job market has become increasingly competitive and the expectations for the candidates has also been dramatically raised. If you want to do research in the current AI industry, a PhD degree is usually required explicitly or implicitly, and I do not possess one. Another important thing I do not possess is the legal status to work in the US. There is no doubt that US is the place with the most opportunities if you want to do interesting ML work. My legal status issue certainly introduced obstacles for the US companies to consider my candidacy. But if I restricted my targeted companies to within Canada, I really did not have many choices to do what I want. With these two main disadvantages, outright rejection and lack of response from companies were really not much a surprise.
In retrospect, I feel like that I may get more interview opportunities if I had been more proactive in asking for referrals on the social media, like Twitter and LinkedIn – don’t be shy and be scared off by the occasional bad experience. In addition, have an open mind and do not set too many restrictions on the job positions that you apply for. It never hurts to have multiple choices and you can always decide afterwards. I could definitely do better in these two aspects, and I encourage you to do so if with a similar situation.
There are always things out of our control, we have no choice but learn to ignore them, stop complaining, and dedicate the attention to the things within our control. So I proceeded to prepare for the left interview opportunities that were presented. My interview preparations were six-fold:
These six elements basically cover all aspects you can expect in interviews for most ML roles. Having good preparations for all these six parts is not an easy task. At first, you may feel a bit overwhelmed or even question the necessarity to prepare on something that you think you may never use in practice. But from my experience, it may be good to stimulate your mind with something different occasionally.
Instead of treating these preparations as a tedious task, you can utilize such an opportunity to learn something new, refresh some rusty knowledge, summarize your past successes and learn from your past failures. In my own case, I went through the classic CSAPP and learnt the C programming language from scratch again. It was actually fun to use some newly learned low-level tricks to significantly accelerate my LeetCode solutions. In fact, engieering is (almost) an essential part if you want to do impactful work on AI nowadays. People in AI industry are keen to make their deployed models more efficient. Sometimes the knowledge of those low-level tricks can give you extra credit when the interviewers ask questions on how to improve model efficiency. And I’ve encountered those questions multiple times!
The same applies to the product sense, that is, a clear understanding of the gap between reasearch and real-world applications. The complete life cycle of deploying and maintaining a ML model in real-world scenarios is way beyond the modelling part, as shown in the below figure. And there are always a lot more objectives to balance besides that single objective your model is optimizing for. Even though you may not work on these parts directly, learning how models are deployed in practice can keep your ego in check, and make you more appreciative of other branches of computer science besides AI, , and even other science beyond CS like human psychology. I’m pretty sure that there are many hidden gems in these fields yet to be discovered which can also help advance AI.
Figure 3: Other components to deploy AI into practice |
In addition, it is a great opportunity to review the work you’ve done in the past and those topics will certainly be mentioned again and again in your interviews. So be prepared to talk about them, and be ready to dig into some technical details if the interviewer is also an expert in the areas you’ve worked on. Also, don’t forget spending some time to think about the big pictures of your field or AI as a whole. What’s the next big research problem you want to solve? Any potential solutions? And what is you plan to use it to generate real-world impact?
In the end, the real goal of interview preparations is to build up your confidence and be comfortable to talk no matter what situations are encountered in the interview process. If I can only give one suggestion for interviews, that is to treat the interviewers as your equal and communicate with them like colleagues. By adopting this attitude, the interview experience will be less stressful, and you may even enjoy the process and learn something new from it!
When I look back, the whole process seems to be natural. But during the process, be prepared for the (almost) inevitable rejections as well as the accompanied pain and frustruation. Do not take the rejections personally and move on quickly!
Take myself for example, out of my five virtual onsites, two companies did not give me offers. Though I thought I was doing pretty well in all the technical rounds, the final decisions may not have depeneded on the interview performances. One position I got rejected from was a machine learning engineer role purely focused on recommendation systems in production. After a frank chat with the hiring manager, they decided not to proceed right away, presumably due to the unaligned interests. Another position was a research engineer role at a prestigious research lab, but the role expectation seemed to be mainly engieering support for the researchers. And no offer was made potentially due to my not-that-strong engineering background.
Though the sample size is small, you can get a sense that the reasons for rejection vary. Most of the time it just means that you and the position are not a good match. There is nothing wrong from your side, and usually you can do little about it (unfortunately). So shake those unpleasant feelings off quickly and focus on the next interview you can control!
To form our decisions, we need first to understand what we really want to do. This is a key to staying motivated. I would assume that most AI people like myself want to do impactful AI work. We want to advance our scientific understanding of intelligence and at the same time we want to have real-world impact and create value for other people and our society. However, long-term impact is almost impossible to predict. One example is that “the best paper award” is never a good indicator of the future impact to the field. Instead, people use metrics, such as citations, number of publications in top venues and ranks on the leaderboards, as proxies to measure impact. As pointed out by many, those metrics are far from perfect. Worse, only caring to optimize those metrics in the short term actually creates many issues in the current AI academia and industry.
So why do we still use them even if we know that they are not that good? I think it’s because that they are fairly straightforward to optimize, just like supervised learning. We have cheap labels and the gradients (the directions of effort) can be conveniently calculated. Of course, some great advancements in AI have been achieved incentived by optimzing these metrics (e.g., by pushing the scores higher on ImageNet and GLUE). But recently the outcomes from optimzing these metrics have started to diverge more and more from our real goals. With exponential growth of citations and papers, do they really help advance our understanding of intelligence faster? With numbers on all the leaderboards being pushed higher and higher, do they lead to the creatation of real values for the people, the increases in productivity or the mitigation of inequalities? I am really not that optimistic on the answers to these questions.
However, don’t be pessimistic prematurely. Let’s take a step back to look at the history between science and productivity. Starting from the Industrial Revolution, the advancement of science lead to major breakthroughs like the creation of steam engine, electricity and the Internet. Those technological breakthoughs got widely applied in the human lives and created massive value, including the generated profits through commericalization. A portion of those profits was returned to research and development activities, and subsequently contributed to the advancement of science. The real force under the hood is always people’s desire to improve the general productivity, so that people’s time and energy can be saved for better purposes and the society can prosper consequently. “History never repeats itself but it rhymes.” In my view, AI definitely has the potential to revolutionize the landscape of human lives just as those major breakthoughs. However, its current impact on the reality is still nowhere near its full potential.
Unfortunately, from research to real-world impact is a long feedback loop, somewhat like reinforcement learning. The rewards are sparse and the gradients (the directions of effort) are hard to estimate along with inherently high variance. There are many possible trajectories, but most of them only lead to failure. Furthermore, there are many critical decisions and efforts which may not be even relevant to AI itself. But someone needs to take up the challenges to bridge the gaps between research and the reality, and to demonstrate AI’s potential through creation of real value; otherwise, current AI development does not deserve the attention, the talent and the capital it currently receives, and we will eventually descend into another AI winter. In the end, realizing AI’s full potential to create real-world value should be our real goals, at least in the industry.
Actually, the general attitude towards AI is becoming more pragmatic in industry, as I’ve heard from all the conversations during my job search. Even some prestigious industrial research labs, which previously an outsider like myself would think would mainly interested in fundamental research, are becoming increasingly interested in making real-world impact. Sometimes, people may feel frustruated about such a change due to the consequent restrictions to the research we can do (or because we still care too much about things like citations and paper numbers?). I felt the same way before when I went through such a transition myself. But now I think a more pragmatic attitude may not be negative for AI in the long run. Creating value, improving producivity and mitigating inequalties through AI are definitely better goals for us than those previosuly mentioned metrics, even though they are much harder to achieve.
You may think those goals are too grand, but those long-term goals can be decomposed into very realistic tasks (e.g., delivering corporate bottom lines or creating successful start-ups). Sometimes, people may think it’s all about money. But that’s just how all great technologies revolutionize the huamn society and bring massive common good for the entire human race. The generated wealth is just the effect of such a process not the cause of it. The real cause is the ambitious and capable people who turn these goals into reality.
Furthermore, I think a better understanding of the real-world applications should never hurt and usually help the research work, even for researchers focusing on theory. And great theoretical work should always give some useful guidances or hints to the practice (may not happen immediately though). Shannon’s theorem is a classic example. No one can deny its immense practical impact on the information technology.
In reality, all parts on the AI job spectrum can contribute to the goals of making real-world impact, though with different paths. At the applied end of the spectrum, the work are more connected (or restricted) to the real world; the research end of the spectrum, there is more freedom to explore your interests (of course, with more competiton and higher requirements). But I don’t think the eventual impacts you can make depends much on the part of the specturm. All types of work have a great chance of making significant impact. The final decision depends on what is your own taste and what opportunities you can choose from.
To choose among job offers, the spectrum of the position is only one consideration. There are so many different factors to consider and the weight for each factor differs for each individual. So here I only list the factors I’m considering with a significant weight when making the final decision myself as a personal example.
Luckily, I am still early in my career and do not have a family yet. I can basically go anywhere I like to pursue my career which is indeed an edge not everyone possesses. I was also fortunate to work at Borealis AI for three years as my first job and have experienced both success and frustration which are all valuable lessons.
When I look back, personal growth and fruitful ouptuts seems very natural when
I’m especially fortunate to have had all these conditions met for some time. And when considering all my options, I always try my best to figure out how likely these conditions will be met. Ask about those things (of course other things may matter to you) frankly when talking to your potential future managers and colleagues. Once you have the answers, the decision should already be made in your heart.
Another thing to bear in mind that things may change quickly and usually be out of your control, for example, re-organizations due to various considersations. The hope to keep things as they are is only an illusion. As the saying goes, the only constant in life is change. So stay adaptive: either adapt to the new environment or change/leave the environment. The willingness to change is indeed an edge no matter when and where your life or career is and do not wait until the reality forces you to do so.
Finally, don’t feel shy or ashamed to ask for more compensation even if the position seems perfect for you. It’s the market that decides your fair value and it’s totally justified to get paid fairly. But also don’t only consider the compensation, especially when you are still early in your career and the absolute difference is not that much. My personal strategy is to ignore the compensations to make my decision first. Then try my best to negotiate the best compensation I can once my cards are all on the table (interview performances, competing offers and so on).
Now, all is settled and I feel grateful about the whole process I’ve gone through. I’m not sure how my career will look like a few years from now, but at least now I’m feeling thrilled to work with the amazing people in Google. Hopefully, I can also make my own contributions through AI to create a better future world. Let’s see how it develops in a few years!
Thanks for the referrals from Bo Chang (Brain), Peter J. Liu (Brain) and Victoria Lin (FAIR, though I didn’t interview with Facebook/Meta eventually). Thanks for the useful discussions and suggestions on career developments with Bo Chang (again!), Luyu Wang (DeepMind) and Jianmo Ni (Google Research). In addition, my gratitude to Samantha Lin (Uber) who helped broaden my product sense and Denny Zhou (Brain) who helped me connect to many different researchers inside Google during and after the team match phase. And my appreciation goes out to Rosanne Liu (Brain & ML Collective) and Simon Prince who provided valuable feedback to the writing of this post.
]]>is called the Newton step (for $f$, at $x$).
The second-order Taylor approximation $\hat f$ of $f$ at $x$ is
\begin{equation} \hat f(x+v) = f(x) + \nabla f(x)^T v + \frac12 v^T \nabla^2 f(x) v. \tag{1} \label{eq:1} \end{equation}
which is a convex quadratic function of $v$, and is minimized when $v=\Delta x_{nt}$. Thus, the Newton step $\Delta x_{nt}$ is what should be added to the point $x$ to minimize the second-order approximation of $f$ at $x$.
The Newton step is also the steepest descent direction at $x$, for the quadratic norm defined by the Hessian $\nabla^2 f(x)$, i.e.,
If we linearize the optimality condition $\nabla f(x^*)=0$ near $x$ we obtain
which is a linear equation in $v$, with solution $v=\Delta x_{nt}$. So the Newton step $\Delta x_{nt}$ is what must be added to $x$ so that the linearized optimality condition holds.
An important feature of the Newton step is that it is independent of linear changes of coordinates. Suppose $T\in \mathbf{R}^{n \times n}$ is nonsingular, and define $\bar f(y)=f(Ty)$. Then we have
where $x=Ty$, likewise we have $\nabla^2 \bar f(y) = T^T\nabla^2f(x)T$. The Newton step for $\bar f$ at $y$ is therefore
where $\Delta x_{nt}$ is the Newton step for $f$ at $x$. Hence the Newton steps of $f$ and $\bar f$ are related by the same linear transformation, and
The quantity
is called the Newton decrement at $x$. We can relate the Newton decrement to the quantity $f(x) - \inf_y \hat f(y)$, where $\hat y$ is the second-order approximation of $f$ at $x$:
We can also express the Newton decrement as
This shows that $\lambda$ is the norm of the Newton step, in the quadratic norm defined by the Hessian.
Newton’s method.
given a starting point $x \in \mathbf{dom} \enspace f$, tolerance $\epsilon > 0$.
repeat
- Compute the Newton step $\Delta x_{nt}$ and decrement $\lambda^2$.
- Stopping criterion. quit** if $\lambda(x)^2/2 \le \epsilon$.
- *Line search. Choose a step size $t > 0$ by backtracking line search.
- Update. $x := x+ t\Delta x_{nt}$.
Newton’s method has several very strong advantages over gradient and steepest descent methods:
The main disadvantage of Newton’s method is the cost of forming and storing the Hessian, and the cost of computing the Newton step, which requires solving a set of linear equations.
]]>General descent method.
given a starting point $x \in \mathbf{dom} \enspace f$.
repeat
- Determine a descent direction $\Delta x$.
- Line search. Choose a step size $t > 0$.
- Update. $x := x+ t\Delta x$.until stopping criterion is satisfied
Backtracking line search.
given a descent direction $\Delta x$ for $f$ at $x \in \mathbf{dom} f, \alpha \in(0,0.5), \beta\in(0,1)$.
starting at $t:=1$.
while $f(x+t\Delta x) > f(x) + \alpha t \nabla f(x)^T \Delta x$, $t:=\beta t$
A natural choice for the search direction is the negative gradient $\Delta x = - \nabla f(x)$.
We must have $f(x^(k)) - p^\ast \le \epsilon$ after at most
iterations of the gradient method with exact line search, where $c=1-m/M<1$.
Similar to exact line search, except that $c=1 - \min{2m\alpha, 2\beta\alpha m/M} < 1.$
The first-order Taylor approximation of $f(x+v)$ around $x$ is
The second term on the righthand side, $\nabla f(x)^T v$, is the directional derivative of $f$ at $x$ in the direction $v$. It gives the approximate change in $f$ for a small step $v$. The step $v$ is a descent direction if the directional derivative is negative.
Let $\lVert \cdot \rVert$ be any norm on $\mathbf{R}^n$. We define a normailzied steepest descent direction as
\begin{equation} \Delta x_{nsd} = \arg\min{\nabla f(x)^T v\ \vert\ \lVert v \rVert = 1}. \tag{1}\label{eq:1} \end{equation}
It is also convenient to consider a steepest descent step $\Delta x_{sd}$ that is unnormalized, by scaling the normalized steepest descent direction in a particular way:
\begin{equation} \Delta x_{sd} = \lVert \nabla f(x) \rVert_\ast \Delta x_{nsd}, \tag{2}\label{eq:2} \end{equation}
where $\lVert \cdot \rVert_\ast$ denotes the dual norm. Note that for the steepest descent step, we have
To simplify the notation, we can look at the problem of solving $\min_v{u^Tv\ \lvert\ \lVert v \rVert \le 1}$ which ends up being equivalent to find the normalized steepest descent step.
The Cauchy-Schwarz inequality gives $\lvert u^Tv\rvert \le \rVert u \rVert \lVert v \rVert$, hence it is easy to see that the minimum is $\min_v{u^Tv\ \lvert\ \lVert v \rVert \le 1}=-\lVert u \rVert$, and the minimizer is $v=-u/\lVert u \rVert$. As a result, the steepest descent direction is simply the negative gradient, i.e., $\Delta x_{sd} = - \nabla f(x)$.
We consider the quadratic norm
where $P \in \mathbf{S}_{++}^n$. The problem is now $\min_v{u^Tv\ \vert\ \lVert P^{1/2}v\rVert\le1}=\min_v{u^Tv\ \vert\ \lVert\delta\rVert\le1, v=P^{-1/2}\delta}$. This is equivalent to $\min_\delta{((P^{-1/2})^Tu)^T\delta\ \vert\ \lVert\delta\rVert\le1}$. The problem above shows that the minimum is $-\lVert (P^{-1/2})^Tu\rVert$ while the maximum $\lVert (P^{-1/2})^Tu\rVert$ is the dual norm according to the definition, and the minimizer is $v=P^{-1/2}\delta=-P^{-1}u/\lVert (P^{-1/2})^Tu\rVert$, so the steepest descent desnt is given by
In addition, the steepest descent method in the quadratic norm $\lVert \cdot \rVert_P$ can be thought of as the gradient method applied to the problem after the change of coordinates $\bar x=P^{1/2}x$.
Let $i$ be any index for which $\lVert \nabla f(x) \rVert_\infty = \lvert (\nabla f(x))_i \rvert$. Then a normalized steepest descent direction $\nabla x_{nsd}$ for the $l_1$-norm is given by
where $e_i$ is the $i$th standard basis vector. An unnormalized steepest descent step is then
The steepest descent algorithm in the $l_1$-norm has a very natural interpertation: At each iteration we select a component of $\nabla f(x)$ with maximum absolute value, and then decrease or increase the corresponding component of $x$, according to the sign of $(\nabla f(x))_i$. The algorithm is sometimes called a corrdinate-descent algorithm, since only one component of the variable $x$ is updated at each iteration.
]]>\begin{equation} \text{minimize}\quad f(x) \tag{1} \label{eq:1} \end{equation}
where $f: \mathbf{R}^n \rightarrow \mathbf{R}$ is convex and twice continously differentiable (which implies that $\mathbf{dom}\enspace f$ is open). We denote the optimal value $\inf_xf(x)=f(x^\ast)$, as $p^\ast$. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^\ast$ to be optimal is
\begin{equation} \nabla f(x^\ast)=0. \tag{2} \label{eq:2} \end{equation}
Thus, solving the unconstrained minimization problem \eqref{eq:1} is the same as finding a solution of \eqref{eq:2}, which is a set of $n$ equations in the $n$ variables $x_1, \dots, x_n$. Usually, the problem must be solved by an iterative algorithm. By this we mean an algorithm that computes a sequence of points $x^{(0)}, x^{(1)}, \dots \in \mathbf{dom}\enspace f$ with $f(x^{(k)})\rightarrow p^\ast$ as $k\rightarrow\infty$. The algorithm is terminated when $f(x^{k}) - p^\ast \le \epsilon$, where $\epsilon>0$ is some specified tolerance.
The starting point $x^{(0)}$ must lie in $\mathbf{dom}\enspace f$, and in addition the sublevel set
must be closed. This condition is satisfied for all $x^{(0)}\in\mathbf{dom}\enspace f$ if the function $f$ is closed.
Note: 1) Continuous functions with $\mathbf{dom}\enspace f=\mathbf{R}^n$ are closed; 2) Another important class of closed functions are continuous functions with open domains.
The general convex quadratic minimization problem has the form
where $P\in\mathbf{S}_+^n$, $q\in\mathbf{R}^n$, and $r\in\mathbf{R}$. This problem can be solved via the optimality conditions, $Px+q=0$, which is a set of linear equations.
One special case of the quadratic minimization problem that arises very frequently is the least-squares problem
The optimality condition
are called the normal equations of the least-squares problem.
where the domain of $f$ is the open set
where $F: \mathbf{R}^n\rightarrow\mathbf{S}^p$ is affine. Here the domain of $f$ is
The objective function is strongly convex on $S$, which means that there exists an $m>0$ such that
\begin{equation} \nabla^2f(x) \succeq mI \tag{3} \label{eq:3} \end{equation}
for all $x\in S$. For $x, y \in S$ we have
for some $z$ on the line segement $[x, y]$. By the strong convexity assumption \eqref{eq:3}, the last term on the righthand side is at least $(m/2)\lVert y-x\rVert^2_2$, so we have the inequality
\begin{equation} f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{m}2\lVert y-x \rVert_2^2 \tag{4} \label{eq:4} \end{equation}
for all $x$ and $y$ in $S$.
Setting the gradient of the righthand side of \eqref{eq:4} with respect to $y$ equal to zero, we find that $\tilde y = x-(1/m)\nabla f(x)$ minimizes the righthand side. Therefore we have
Since this holds for any $y\in S$, we have
\begin{equation} p^* \ge f(x) - \frac1{2m}\lVert \nabla f(x)\rVert_2^2 \tag{5} \label{eq:5} \end{equation}
This inequality shows that if the gradient is small at a point, then the point is nearly optimal.
Apply \eqref{eq:4} with $y=x^\ast$ to obtain
where we use the Cauchy-Schwarz inequality in the second inequality. Since $p^\ast \le f(x)$, we must have
Therefore, we have
\begin{equation} \lVert x - x^\ast \rVert_2 \le \frac2m\lVert \nabla f(x) \rVert_2. \tag{6}\label{eq:6} \end{equation}
If there are two optimal point $x^\ast_1, x^\ast_2$, according to \eqref{eq:6},
Hence, $x_1^\ast = x_2^\ast$, the optimal point $x^\ast$ is unique.
There exists a constant $M$ such that
for all $x \in S$. This upper bound on the Hessian implies for any $x, y \in S$,
minimizing each side over $y$ yields
The ratio $\kappa=M/m$ is an upper bound on the condition number of the matrix $\nabla^2 f(x)$, i.e., the ratio of its largest eigenvalue to its smallest eigenvalue.
We define the width of a convex set $C \subseteq \mathbf{R}^n$, in the direction $q$, where $\lVert q \rVert_2 = 1$, as
The minimum width and maximum width of $C$ are given by
The condition number of the convex set $C$ is defined as
Suppose $f$ satisfies $mI \preceq \nabla^2 f(x) \preceq MI$ for all $x\in S$. The condition number of the $\alpha$-sublevel $C_\alpha={x \vert f(x) \le \alpha}$, where $p^\ast < \alpha \le f(x^{(0)})$, is bounded by
It must be kept in mind that the constants $m$ and $M$ are known only in rare cases, so they cannot be used in a practical stopping criterion.
]]>Since $Y\in{-1,1}$, we can expand the expectation as follows:
In order to minimize the expectation, we equal derivatives w.r.t. $f(x)$ as zero:
which gives:
If $Y=1$, then $Y’=1$, which gives
Likewise, if $Y=-1$, then $Y’=0$, which gives
As a result, the binomial log-likelihood loss is equivalent to the deviance. In the language of neural networks, the cross-entropy is equivalent to the softplus. The only difference is that $0$ is used to indicate negative examples in cross-entropy; while $-1$ is used in softplus.
This section explains the choice of loss functions for both classification and regression. It gives a very direct expalanation about why square loss is undesirable for classification. Highly recommended!
]]>Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:
The estimate is $\hat f(x_0)=\hat\alpha(x_0)+\hat\beta(x_0)x_0$. Define the vector-value function $b(x)^T=(1,x)$. Let $\mathbf{B}$ be the $N \times 2$ regression matrix with $i$th row row $b(x_i)^T$, $\mathbf{W}(x_0)$ the $N\times N$ diagonal matrix with $i$th diagonal element $K_\lambda (x_0, x_i)$, and $\theta=(\alpha(x_0), \beta(x_0))^T$.
Then the above optimization problem can be rewritten as
Equal the derivative w.r.t $\theta$ as zero, we get
Then
It’s claimed that $\sum_{i=1}^Nl_i(x_0)=1$ and $\sum_{i=1}^N(x-x_0)l_i(x_0)=0$ in the book, so that the bias $\text{E}(\hat f(x_0))-f(x_0)$ depends only on quadratic and higher-order terms in the expansion of $f$. However, the proof is not given. Here I will give the detailed derivations of these two equations.
First, define the following terms:
Then, we can represent the estimate as
When $y=\mathbf{1}$, $m_0=S_0$ and $m_1=S_1$, we get
When $y=\mathbf{x}-x_0$,
More generally, it’s easy to show that $\sum_{i=1}^N(x_i-x_0)^pl_i(x_0)=0$ when $p>0$.
We only prove the case when the input $x$ is one-dimensional. Similar strategy can be used to prove the case for high-dimensional input, but it’ll be a little bit complicated if you’re interested. Have fun!
]]>Equal the derivative of Equation (5.11) as zero, we get
Put the terms related to $\theta$ on one side and the others on the other side, we get
Multiply the inverse of $N^TN+\lambda\Omega_N$ on both sides completes the derivation of Equation (5.12)
It’s a little confusing to get Equation (5.18) directly from Equation (5.17) and its original form Equation (5.11). In order to give a clear explanation, here we give the proof of the equation,
which are the different terms between Equation (5.11) and Equation (5.18).
We know that
following Equation (5.14) in the book.
From Equation (5.17), we can get
Plug the above two equation into the right side of the equation remains to be proved, we get
which completes the proof.
]]>$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if $\mathbf{dom}\ f$ is a convex set and
for all $x,y\in \mathbf{dom}\ f, 0\le\theta\le1$
$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if and only if the function $g: \mathbb{R} \rightarrow \mathbb{R}$,
is convex (in $t$) for any $x \in \mathbf{dom}\ f, v\in\mathbb R^n$
extended-value extension $\tilde f$ of $f$ is
1st-order condition: differentiable $f$ with convex domain is convex iff
2nd-order conditions: for twice differentiable $f$ with convex domain - $f$ if convex if and only if - if $\nabla^2f(x)\succ0$ for all $x\in\mathbf{dom}\ f$, then $f$ is strictly convex
$\alpha$-sublevel set of $f: \mathbb R^n \rightarrow \mathbb R$:
sublevel sets of convex functions are convex (converse if false)
If $f$ is concave, then its $\alpha$-superlevel set, given by ${x\in\mathbf{dom}\ f\mid f(x)\le\alpha}$, is a convex set
epigraph of $f:\mathbb R^n \rightarrow \mathbb R$:
$f$ is convex if and only if $\mathbf{epi}\ f$ is a convex set
$f$ is concave if and only if its hypograph, defined as
is a convex set
Jensen’s Inequality: if $f$ is convex, then for $0\le\theta\le1$,
extension: if $f$ is convex, then
for any random variable $z$
nonnegative multiple: $\alpha f$ is convex if $f$ is convex, $\alpha \ge 0$
sum: $f_1+f_2$ convex if $f_1,f_2$ convex (extends to infinite sums, integrals)
composition with affine function: $f(Ax+b)$ is convex if $f$ is convex
pointwise maximum: if $f_1,\dots,f_m$ are convex, then $f(x)=\max{f_1(x),\dots,f_m(x)}$ is convex
pointwise supermum: if $f(x,y)$ is convex in $x$ for each $y\in\mathcal{A}$, then
is convex
similarly, the pointwise infimum of a set of concave functions is a concave function
composition with scalar functions: composition of $g: \mathbb R^n \rightarrow \mathbb R$ and $h: \mathbb R\rightarrow \mathbb R$:
$f$ is convex if $g$ convex, $h$ convex, $\tilde h$ nondecreasing; $g$ concave, $h$ convex, $\tilde h$ nonincreasing
Note: monotonicity must hold for extended-value extension $\tilde h$
vector composition: composition of $g:\mathbb R^n \rightarrow \mathbb R^k$ and $h:\mathbb R^k \rightarrow \mathbb R$:
$f$ is convex if $g_i$ convex, $h$ convex, $\tilde h$ nondecreasing in each argument; $g$ concave, $h$ convex, $\tilde h$ nonincreasing in each argument
minimization: if $f(x,y)$ is convex in $(x,y)$ and $C$ is a convex set then
is convex
perspective: the perspective of a function $f:\mathbb R^n \rightarrow \mathbb R$ is the function $g:\mathbb R^n \times \mathbb R \rightarrow \mathbb R$,
$g$ is convex if $f$ is convex
the conjugate of a function $f$ is
conjugate of the conjugate: if $f$ is convex and closed, then $f^{**}=f$
differentiable functions: The conjugate of a differentiable function $f$ is also called the Legendre transform of $f$. Let $z\in\mathbb{R}^n$ be arbitrary and define $y=\nabla f(z)$, then we have
scaling and composition with affline transformation: For $a>0$ and $b\in\mathbb{R}$, the conjugate of $g(x)=af(x)+b$ is $g^*(y)=af^*(y/a)-b$.
Suppose $A\in\mathbb{R}^{n\times n}$ is nonsingular and $b\in\mathbb{R}^n$. Then the conjugate of $g(x)=f(Ax+b)$ is
with $\mathbf{dom}\ g^*=A^T\mathbf{dom}\ f^*$
sums of independent functions: if $f(u,v)=f_1(u)+f_2(v)$, where $f_1$ and $f_2$ are convex functions with conjugates $f_1^*$ and $f_2^*$, respectively, then
$f:\mathbb R^n\rightarrow \mathbb R$ is quasiconvex if $\mathbf{dom}\ f$ is convex and the sublevel sets
are convex for all $\alpha$
modified Jensen inequality: for quasiconvex $f$
first-order condition: differentiable $f$ with convex domain is quasiconvex iff
operations that preserve quasiconvexity:
sum of quasiconvex functions are not necessarily quasiconvex
a positive function $f$ is log-concave if $\log f$ is concave:
$f$ is log-convex if $\log f$ is convex
properties of log-concave functions:
for all $x\in\mathbf{dom}\ f$
is log-concave
consequences of integration property:
is log-concave
$f:\mathbb{R}^n\rightarrow\mathbb{R}^m$ is $K$-convex if $\mathbf{dom}\ f$ is convex and
for $x,y\in\mathbf{dom}\ f,0\le\theta\le1$
]]>Suppose $x_1\ne x_2$ are two points in $\mathbb{R}^n$.
line through $x_1$, $x_2$: all points
affine set: contains the line through any two distinct points in the set
line segment between $x_1$ and $x_2$: all points
with $0\leq\theta\leq1$
convex set: contains line segment between any two points in the set
convex combination of $x_1,\dots,x_k$: any point $x$ of the form
with $\theta_1+\dots+\theta_k=1,\theta_i \geq 0$
convex hull of a set $C$, denoted $\mathbf{conv}\ C$: set of all convex combinations of points in $C$
conic combination of $x_1$ and $x_2$: any point of the form
with $\theta_1 \geq 0, \theta_2 \geq 0$
convex cone: set that contains all conic combinations of points in the set
hyperplane: set of the form {$x\mid a^Tx=b$}$(a\ne0)$
halfspace: set of the form {$x\mid a^Tx\leq b$}$(a\ne0)$
(Euclidean) ball with center $x_c$ and radius $r$:
ellipsoid: set of the form
with $P\in \mathbf{S}^n_{++}$ (i.e., P symmetric positive definite)
another representation: {$x_c+Au\mid \lVert u\rVert_2\le1$} with $A$ square and nonsingular
norm: a funtion $\lVert \centerdot \rVert$ that satisfies
norm ball with center $x_c$ and radius
norm cone:
polyhedra: solution set of finitely many linear inequalities and equalities
($A\in \mathbb{R}^{m\times n}$, $C\in\mathbb{R}^{p\times n}$, $\preceq$ is componentwise inequality)
positive semidefinite cone:
intersection: the interction of (any number of) convex sets is convex
affine function: suppose $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is affine ($f(x)=Ax+b$ with $A\in\mathbb{R}^{m\times n}, b\in\mathbb{R}^m$)
perspective function $P: \mathbb{R}^{n+1} \rightarrow \mathbb{R}^n$:
images and inverse images of convex sets under perspective are convex
linear-fractional function $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$:
images and inverse images of convex sets under linear-fractional functions are convex
a convex cone $K\subseteq\mathbb{R}^n$ is a proper cone if
generalized inequality defined by a proper cone $K$:
$x\in S$ is the minimum element of $S$ with respect to $\preceq_K$ if
$x\in S$ is a minimal element of $S$ with respect to $\preceq_K$ if
separating hyperplane theorem: if $C$ and $D$ are disjoint convex sets, then there exists $a\ne0$, $b$ such that
supporting hyperplane to set $C$ at boundary point $x_0$:
where $a\ne0$ and $a^Tx\le a^Tx_0$ for all $x\in C$
supporting hyperplance theorem: if $C$ is convex, then there exists a supporting hyperplane at every boundary point of $C$
dual cone of a cone $K$:
Dual cons satisfy several properties, such as:
Thsese properties show that if $K$ is a proper cone, then so is its dual $K^{*}$, and moreover, that $K^{**}=K$
dual cones of proper cones are proper, hence define generalized inequalities:
Some import properties relating a generalized inequality and its dual are:
Since $K=K^{**}$, the dual generalized inequality associated with $\preceq_{K^{*}}$ is $\preceq_K$, so these properties hold if the generalized inequality and its dual are swapped
dual characterization of minimum element w.r.t. $\preceq_K$: $X$ is minimum element of $S$ iff for all $\lambda \succ_{K^*}0$, $x$ is the unique minimizer of $\lambda^Tz$ over $z\in S$
dual characterization of minimal element w.r.t. $\preceq_K$:
For that each class’s density follows multivariate Gaussian
Take the logarithm of $f_k(x)$, we get
where $c = -\log [(2\pi)^{p/2}\lvert\Sigma\rvert^{1/2}]$ and $\mu_k^T\Sigma^{-1}x=x^T\Sigma^{-1}\mu_k$. Following the above formula, we can derive Equation (4.9) easily
It’s stated in the book that the LDA classifier can be implemented by the following pair of steps:
However, detailed explanation is not given in the book. Here, I give some skipped mathematical steps which may help the understanding.
which shows that the covariance estimate of $X^*$ is the identity.
Note that the classification for LDA is based on the linear discriminat functions
which is the Equation (4.10) in the book. Since the input $x$ is same for each class, so we can add back a term $\frac12x^T\Sigma^{-1}x$ which is cancelled in the previous derivation. Now the functions are turned into:
We know that $\Sigma=I$ in the transformed space, so $\delta_k(x)=-1/2\lVert x-\mu_k\rVert_2+\log\pi_k$. And $\mu_k$ is the centroid for the $k$th class. The claimed method to classify is proved.
In the two-class case, $p_1(x;\beta)=p(x;\beta)$ and $p_2(x;\beta) = 1-p(x;\beta)$ where
The Equation (4.21) can be derived easily as follows,
Note that
Plug it into Equation (4.21), we get
]]>The least squares estimate of $\beta$ is given by the book’s Equation (3.6)
From the previous post, we know that $\mathrm{E}(\mathbf{y})=X\beta$. As a result, we obtain
Then, we get
The variance of $\hat \beta$ is computed as
If we assume that the entries of $\mathbf{y}$ are uncorrelated and all have the same variance of $\sigma^2$, then $\mathrm{Var}(\varepsilon)=\sigma^2I_N$ and the above equation becomes
This completes the derivation of Equation (3.8).
There are a lot concepts of statistics in this part. It’s better to go through Chapter 6 and Chapter 10 in All of Statistics to have a taste about hypothesis tests and confidence intervals.
From my own viewpoint, Z-score and F-statistic give a measure about whether the corresponding features are useful or not. They can be used within some feature selection methods. However, they’re not very useful in practice. The perferred feature selection methods are discussed in Section 3.3 in the book.
which completes the derivation of Equation (3.20).
Equation (3.22) shows that the expected quadratic error can be broken down into two parts as
The first error component $\sigma^2$ is unrelated to what model is used to describe our data. It cannot be reduced for it exists in the true data generation process. The second source of error corresponding to ther term $\text{MSE}(\tilde f(x_0))$ represents the error in the model and is under control of us. By Equation (3.20), the mean square error can be broken down into two terms: a model variance term and a model bias squared term. How to make these two terms as small as possible while considering the trade-offs between them is the central topic in the book.
The first thing that comes to my mind when I read this section is that why we need this when we already have the ordinary least square (OLS) estimate of $\beta$:
It’s because we want to study how to obtain orthogonal inputs instead of correlated inputs, since orthogonal inputs have some nice properties.
Following Algorithm 3.1, we can transform the correlated inputs $\mathbf{x}$ to the orthogonal inputs $\mathbf{z}$. Another view is that we form an orthogonal basis by performing the Gram-Schmidt orthogonilization procedure on $X$’s column vectors and obtain an orthogonal basis $\mathbf{z}_{i=1}^p$. With this basis, linear regression can be done simply as in the univariate case as shown in Equation (3.28):
Following this equation, we can derive Equation (3.29):
We can write the Gram-Schmidt result in matrix form using the QR decomposition as
In this decomposition $Q$ is a $N\times(p+1)$ matrix with orthonormal columns and $R$ is a $(p+1)\times(p+1)$ upper triangular matrix. In this representation, the OLS estimate for $\beta$ can be written as
which is Equation (3.32) in the book. Following this equation, the fitted value $\mathbf{\hat y}$ can be written as
which is Equation (3.33) in the book.
If we compute the singular value decomposition (SVD) of the $N\times p$ centered data matrix $X$ as
where $U$ is a $N \times p$ matrix with orthonormal columns that span the column space of $X$, $V$ is a $p \times p$ orthogonal matrix, and $D$ is a $p \times p$ diagonal matrix with elements $d_j$ ordered such that $d_1\ge d_2 \ge \dots \ge d_p \ge 0$. From this representation of $X$ we can derive a simple expression for $X^TX$:
which is the Equation (3.48) in the book. Using this expression, we can compute the least squares fitted values as
which is the Equation (3.46) in the book. Similarly, we can find solutions for ridge regression as
which is the Equation (3.47) in the book. Since we can estimate the sample variance by $X^TX/N$, the variance of $\mathbf{z}_1$ can be derived as follows:
which is the Equation (3.49) in the book. Note that $v_1$ is the first column of $V$ and $V$ is orthogonal, so that $V^Tv_1$ is $[1,0, \dots, 0]^T$.
The degrees-of-freedom of the fitted vector $\mathbf{\hat y}=(\hat y_1, \dots, \hat y_N)$ is defined as
in the book. Also, it’s claimed that $\text{df}(\mathbf{\hat y})$ is $k$ for ordinary least squares regression and $\text{tr}(\mathbf{S}_{\lambda})$ for ridge regresssion without proof in the book. Here, we’ll derive these two expressions. First, we define $e_i$ as a $N$-element vector of all zeros with a one in the $i$th spot. It’s easy to see that $\hat y_i=e_i^T\mathbf{\hat y}$ and $y_i=e_i^T\mathbf{y}$, so that
For OLS regression, we have $\mathbf{\hat y}=X(X^TX)^{-1}X^T\mathbf{y}$, so the above expression for $\text{Cov}(\mathbf{\hat y}, \mathbf{y})$ becomes
Thus,
where $x_i=X^Te_i$ is the $i$th row of $X$ or $i$th sample’s feature vector. According to the given formula, we get
If you’re not familar the basic properties of trace, you can refer to this page. Note that
Thus, when there are $k$ predictors we get
the claimed result for OLS in the book. Similarly for ridge regression,
which is the Equation (3.50) in the book.
The expected predicted error (EPE) under the squared error loss:
Taking derivatives with respect to $\beta$:
In order to minimize the EFE, we make derivatives equal zero which gives Equation (2.16):
Note: $x^T\beta$ is a scalar, and $\beta$ is a constant.
There are $N$ $p$-dimensional data point $x_1,\dots, x_N$, that is, $N\times p$ dimensions in total. Let $r_i=\Vert x_i \Vert$. Without loss of generality, we assume that $A < r_1 < \dots < r_n < 1$. Let $U(A)$ be the region of all possible sampled data which meet the assumptation:
The goal is to find $A$ such that $U(A)=\frac12U(0)$. It turns out to be a integration problem on a $N \times p$ dimensional space.
With some mathematical techniques (which make me overwhelmed), we can get $U(A)=(1-A^p)^N$. Then $U(0)=1$. Solving $(1-A^p)^N=1/2$, we obtain Equation (2.24):
The variation is over all training sets $\mathcal{T}$, and over all values of $y_0$, while keeping $x_0$ fixed. Note that $x_0$ and $y_0$ are chosen independently of $\mathcal{T}$ and so the expectations commute: $\mathrm{E}_{y_0\vert x_0}\mathrm{E}_{\mathcal{T}}=\mathrm{E}_{\mathcal{T}}\mathrm{E}_{y_0 \vert x_0}$. Also $\mathrm{E}_\mathcal{T}=\mathrm{E}_\mathcal{X}\mathrm{E}_{\mathcal{Y \vert X}}$.
In order to make the derivation more comprehensible, here lists some definitions:
$y_0-\hat y_0$ can be written as the sum of three terms:
Following above definitions, we have $U_1=\varepsilon$, $U_3=0$. In addition, clearly we have $\mathrm{E}_\mathcal{T}U_2=0$. When squaring $U_1-U_2-U_3$, we can eliminate all three cross terms and one squared terms $U_3^2$.
Following the definition of variance, we have: $\mathrm{E}_{y_0\vert x_0}\mathrm{E}_\mathcal{T}U_1^2=\mathrm{Var}(\varepsilon)=\sigma^2$ and $\mathrm{E}_\mathcal{T}(\hat y_0 - \mathrm{E}_\mathcal{T}\hat y_0)^2=\mathrm{Var}_\mathcal{T}(\hat y_0)$.
Since $U_2=\sum_{i=1}^Nl_i(x_0)\varepsilon_i$, we have $\mathrm{Var}_\mathcal{T}(\hat y_0)=\mathrm{E}_\mathcal{T}U_2^2$ as
Since $\mathrm{E}_\mathcal{T}\varepsilon\varepsilon^T=\sigma^2I_N$, this is equal to $\mathrm{E}_\mathcal{T}x_0(X^TX)^{-1}x_0\sigma^2$. This completes the derivation of Equation (2.27).
Under the conditions stated by the authors, $X^TX/N$ is then approximately equal to $\mathrm{Cov}(X)=\mathrm{Cov}(x_0)$. Applying $\mathrm{E}_{x_0}$ to $\mathrm{E}_\mathcal{T}x_0(X^TX)^{-1}x_0\sigma^2$, we obtain (approximately)
This completes the derivation of Equation (2.28).
为了更加准确的理解上述两个定理,我们需要理解概率层面的收敛,而非微积分里的收敛(如果对与任意$\epsilon>0$和足够大的$n$,$\vert x_n -x \rvert < \epsilon$,那么我们称这一实数列$x_n$收敛于极限$x$)。
在统计中,主要有两种类型的收敛:
令$X_1,X_2,\dots$为一系列随机变量并令$X$为另一个随机变量。令$F_n$表示$X_n$的概率密度函数 (CDF),$F$表示$X$的概率密度函数。 1) $X_n$在概率上收敛于$X$ (converges in probability),写作$X_n\xrightarrow[]{P} X$,如果对于任意$\epsilon>0$
当$n\rightarrow\infty$。 2) $X_n$在分布上收敛于$X$ (converges in distribution), 写作$X_n\rightsquigarrow X$,如果对于任意在$F$中连续的点$t$,
另外由$X_n\xrightarrow[]{P}X$可以推出$X_n\rightsquigarrow X$。
P.S. 其实还有另外两种类型的收敛,他们之间的关系也更加复杂,这里的重点是介绍两个定理,所以这部分从简,想深入了解可参考相关教材。
令$X_1,X_2\dots$为独立同分布 (IID) 的样本,令$\mu=\mathbb{E}(X_1)$,$\sigma^2=\mathbb{V}(X_1)$。另外$\bar X_n = n^{-1}\sum_{i=1}^nX_i$为样本均值并且$\mathbb{E}(\bar X_n)=\mu$,$\mathbb{V}(\bar X_n)=\sigma^2/n$。
弱大数定理(WLLN):如果$X_1,\dots, X_n$为独立同分布,那么$\bar X_n \xrightarrow[]{P} \mu$。
理解:当$n$越来越大时,$\bar X_n$的分布变得越来越聚集于$\mu$附近。
中心极限定理(CLT):如果$X_1,\dots,X_n$为独立同分布,那么
其中$Z \sim N(0,1)$,即正态分布。
理解:关于$\bar X_n$的概率表达式可以近似于正态分布。注意并不是随机变量本身近似于正态分布。
然而大多数时候我们并不知道$\sigma$,实际中我们可以用标准差$S_n^2=\frac1{n-1}\sum_{i=1}^n(X_i-\bar X_n)^2$来代替$\sigma$。
定理:在与CLT相同的条件下,
为了能更加广泛有效的利用中心极限定理,掌握Delta方法是相当有必要的。
Delta方法:
其中$g$是一个可导函数从而$g’(\mu)\ne0$。
P.S. 强大数定理,多元中心极限定理及多元Delta方法由于比较复杂就省略了。另外由于比较懒,所以有助于理解的例子也没用写,纯粹当是记录一下学习的过程了。
]]>假如我们用神经网络在MNIST数据集上训练了一个分类器,我们在测试集上得到了一个错误率,比如$0.05$。那么这是否意味着我们可以保证我们的神经网络一定能达到$95\%$的正确率呢?显然训练一次得出的结果是不可靠的。那么,我们有多大的把握(概率)来相信这一观察到的错误率呢?
这时候就需要一些统计的语言了:假设我们有$n$个测试样本,每个测试样本分类的正确与否都是一个随机变量$X_1,\dots,X_n$。如果分类错误$X_i=1$,否则$X_i=0$。显而易见,$\bar X_n=n^{-1}\sum_{i=1}^nX_i$就是观察到的错误率。我们可以把每个$X_i$当做一个均值为$p$服从Bernoulli分布的随机变量,从而$p$就是真正(但是永远无法准确知晓)的错误率。从我们的角度来看,我们希望$\bar X_n$应该接近$p$。那么$\bar X_n$和$p$的概率超过一个固定值$\epsilon$的概率有多大呢? 这个概率就是$\mathbb{P}(\lvert\bar X_n -p\rvert > \epsilon)$,通常我们很难直接计算出它的值,这时我们就需要不等式来给这个概率设定一些边界 (bound)。
我们的第一个不等式就是马尔科夫不等式 (Markov’s inequality):
令$X$为一个非负随机变量并假设$\mathbb{E}(X)$存在。对任何$t>0$,
咋一看,我们似乎无法直接运用马尔科夫不等式来限定$\mathbb{P}(\lvert\bar X_n -p\rvert > \epsilon)$的值。但其实只要稍做转换,便可得到另一个可以直接使用的不等式,即切比雪夫不等式 (Chebyshev’s inequality):
令$\mu = \mathbb{E}(X)$和$\sigma^2=\mathbb{V}(X)$,从而
这个不等式可以直接从马尔科夫不等式得出:
由于$X_i$服从Bernoulli分布,所以$\mathbb{V}(\bar X_n)=\mathbb{V}(X_i)/n=p(1-p)/n$,从而
注意对任意$0<p<1$,$p(1-p)\le\frac14$。如果我们希望神经网络的真实错误率与观察到的错误率之间的误差超过$\epsilon=0.05$的概率不超过$0.05$,那么通过简单的计算可得我们需要大约$n=2000$个测试样本。怎么样,是不是觉得不等式变得有用了?
切比雪夫不等式只是一个相对粗略的估算,其实还有各种更为精确的不等式。当然,随之然来的是各种各样的限制条件,这里给出一个更精确的霍夫丁不等式 (Hoeffding’s inequality):
令$X_1\dots X_n \sim \mathrm{Bernoulli}(p)$,对任意$\epsilon>0$,
其中$\bar X_n = n^{-1}\sum_{i=1}^nX_i$。
通过上面的不等式,经过计算发现其实我们只需要$738$个测试样本就足够了。
P.S. 上面的Hoeffding’s inequality只是针对Bernoulli变量的特殊形式,完整的不等式可以参考Wikipedia或相关教材。
经过一个小例子,我们对不等式的作用有了一个直观的认识。但不等式真正的用武之地是在各种推导证明之中的,虽然我看到那种满篇公式、各种bound来bound去的paper都是自动略过的,而且现在在做的东西也是偏应用层面。但谁知道将来在研究中会不会经常用到呢,至少,现在我学会了估算可靠的测试样本大小的方法。
]]>年份 | 面试人数 | 录取人数 | 不合格的人数 | 不合格率 |
---|---|---|---|---|
2014 | 1000 | 350 | 30 | 8.57% |
2015 | 1000 | 650 | 10 | 1.54% |
2016 | 1000 | 200 | 10 | 5.00% |
那么这家公司的招聘策略是否有效呢?
显而易见,公司的招聘标准是根据解出题目的多寡来判断应试者是否合格。用统计的语言的说,就是每个应试者录取与否是一个随机变量 (Random Variable) $X_i$, 其样本空间 (Sample Space) 是${0, 1}$,其中$0$代表不合格,$1$代表合格。而这些随机变量均服从于一个由最少解出题数$\theta$决定的概率分布$p(x;\theta)$。用数学的语言就是$X_1,\dots,X_n \sim p(x;\theta)$。公司策略所假设的概率分布是理想化的(不切实际的),对于给定$\theta=\theta_0$, $p(x;\theta_0)$的概率分布可以用下表表示:
$p(x;\theta)$ | $x=0$ | $x=1$ |
---|---|---|
$\theta<\theta_0$ | 1 | 0 |
$\theta\ge\theta_0$ | 0 | 1 |
为了判断招聘策略(假设概率分布$p$)的有效性,我们需要针对概率分布的参数,即最少解出题数$\theta$进行假设。首先,我们提出一个空假设 (Null Hypothesis) $H_0: \theta=\theta_0$和一个替代假设 (Alternative Hypothesis) $H_1: \theta \ne \theta_0$。而假设检验需要考虑的问题并非$H_0$是对是错,而是我们是否有足够的证据来证明$H_0$是错的。
那么去哪里找证据呢?当然是观察实际数据啦。每当我们观察到一组数据时,我们需要确定一些指标来支撑我们的判断。对于公司招聘来说,最直观的指标就是不合格率$1-\bar X$了。这里我们需要设定一个标准来决定什么时候来拒绝$H_0$,即我们的最低期望,比方说不合格率高于$c$就拒绝$H_0$。
所谓检验力$\beta(\theta)$,就是在给定$\theta$下拒绝$H_0$的概率。针对招聘问题,就是$\beta(\theta)=P_\theta(1-\bar X >c)$。而显著性水平$\alpha$就是在$H_0$成立的条件下,允许的$\beta(\theta)$的最大值。有点绕是不是?总结一下,其实$\alpha$就是“当前招聘策略下,不合格率高于$c$的最大概率”。由于在招聘问题的$H_0$下$\theta$只有一个取值$\theta_0$,所以$\alpha=\beta(\theta_0)$。需要注意的是,$\alpha$是由$c$决定的,即提前人为设定的。
P值$p$是我们会拒绝$H_0$时能接受最小的$\alpha$,换言之,当$\alpha > p$时,我们便会拒绝$H_0$。针对招聘问题,如果我们希望不合格率不得高于$5\%$,即$c=0.05$,$\alpha=\beta(\theta_0)=P_{\theta_0}(1-\bar X > 0.05)$。当不合格率高于$5\%$的概率高于$p$时,就认为当前的招聘策略无效。所以,$p$越小,证明$H_0$是错的证据就需要越有力。正因如此,才在学界有了“$p$值为$0.05$,即可将统计结果视为显著”这样的规则。当然,不要因此而误认为$p=P(H_0)$,即招聘策略有效 ($H_0$正确) 的概率。
关于P值,还有一个很不靠谱的特征:那就是当$H_0$实际上是正确的时候,$p$服从$0-1$均匀分布。也就是说,$p$值即使很小也并一定意味着$H_0$一定是错的,而有可能只是碰巧发生的。还真是够不靠谱的,无怪乎会被各种吐槽。
最后,回归到招聘策略是否有效的问题。根据$P_{\theta_0}(1-\bar X > 0.05)$,$\alpha=0.333$,然后由于只有三条观察数据,$p$值在这个问题上并没有太大的参考价值。对于其他很多问题来说,同样也是如此。总而言之就是P值虽然被广泛使用,然后大多数情况下并没有什么卵用。
]]>另外,创造智械还会带来一系列现实问题,诸如道德,法律,就业等等问题。就比如无人车车祸的责任判定,机器取代工人而引发的大量失业等等。
总而言之,与其创造一个与人类等价的新族类“智械”,不如思考如何利用人工智能更好的辅助人类,定向地将人类进化的更加强大。相信这是在可预见的未来,人工智能领域的一个趋势。
]]>虽然成书于94年,但书中种种观点和当今社会与技术的发展却有诸多不谋而合之处。读罢前几章,最为深刻的印象就是蜂群思维,分布式系统,去中心化等等一系列相关的概念。总而言之,论述的是一种与传统自上而下的系统相悖的自下而上的系统。这里的系统是一个非常宽泛的说法,它可以是机器,可以是软件,也可以是动物,乃至于人类社会、政治体系、万维网等等。可能是我实在是孤陋寡闻,在阅读这本书之前,我潜意识里确实认为绝大多数系统都应该有一个中心,拥有绝对的权威并下达指令,比如PC的CPU,古代的皇帝,人的大脑等等。然而这本书却提出了一个截然不同的系统,简而言之就是没有一个绝对的中心,每一个个体的行为决定了整体的行动(也可以简单的理解为少数服从多数,但实际情况往往更加复杂)并由简单的行为(操作)逐层向上模块化的增加更加复杂的行为(操作),另外分布式的存在方式令其拥有更强的容错性和在部分失灵的情况下能够继续运转的稳定性。现代科学在群体动物(蜂群、蚁群)中发现了这样的系统,而脑科学的发展也说明大脑并非我们原本想象的那样控制着人体的一切。无数的神经的共同作用造就了大脑,而大脑与身体的各种感官也似乎并非简单的从属关系。
这样的系统的优越性早已在工业界得到了印证。我也在这记录一下我在阅读过程中所联想到的各种相干(或者不相干?)的点点吧:
微博上看到有人喷这本书的作者就是个大忽悠,对于作者现在到哪儿去演讲、圈钱、布道什么的不做评价。但上个世纪的书对当下仍有现实启发意义,我觉得这本书还是值得一读的。
]]>Both TD and Monte Carlo methods use experience to solve the prediction problem. Given some experience following a policy $\pi$, both methods update their estimate $v$ of $v_\pi$ for the nonterminal states $S_t$ occurring in that experience. Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$ (only then is $G_t$ known), TD methods need wait only until the next time step. The simplest TD method, known as TD(0), is
TD methods combine the sampling of Monte Carlo with the bootstrapping of DP. As we shall see, with care and imagination this can take us a long way toward obtaining the advantages of both Monte Carlo and DP methods.
Note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_t+\gamma V(S_{t+1})$. This quantity, called the TD error, arises in various forms throughout reinforcement learning:
Also note that the Monte Carlo error can be written as a sum of TD errors:
This fact and its generalizations play important roles in the theory of TD learning.
Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer. Updates are made only after processing each complete batch of training data. We call this batch updating.
Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. In this case, the maximum-likelihood estimate is the model of the Markov process formed in the obvious way from the observed episodes: the estimated transition probability from $i$ to $j$ is the fraction of observed transitions from $i$ that went to $j$, and the associated expected reward is the average of the rewards observed on those transitions. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated. In general, batch TD(0) converges to the certainty-equivalence estimate.
As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part.
In the previous section we considered transitions from state to state and learned the values of states. Now we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values:
One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning, defined by:
In this case, the learned action-value function, $Q$, directly approximates $q^*$, the optimal action-value function, independent of the policy being followed.
Consider the algorithm with the update rule:
but that otherwise follows the schema of Q-learning. Given the next state $S_{t+1}$, this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and accordingly it is called expected Sarsa.
All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. We call this maximization bias.
One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two independent estimates, call them $Q_1(a)$ and $Q_2(a)$, each an estimate of the true value $q(a)$, for all $a\in\mathcal{A}$. We could then use one estimate, say $Q_1$, to determine the maximizing action $A^*=\mathrm{argmax}_aQ_1(a)$, and the other, $Q_2$, to provide the estimate of its value, $Q_2(A^*)=Q_2(\mathrm{argmax}_aQ_1(a))$. This estimate will then be unbiased in the sense that $\mathbb{E}[Q_2(A^*)]=q(A^*)$. We can also repeat the process with the role of the two estimates reversed to yield a second unbiased estimate $Q_1(A^*)=Q_1(\mathrm{argmax}_aQ_2(a))$. This is the idea of doubled learning.
]]>An obvious way to estimate the state-value function which is the expected return starting from that state from experience, is to average the returns observed after visits to that state. This idea underlies all Monte Carlo methods.
In particular, suppose we wish to estimate $v_\pi(s)$, the value of a state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing through $s$. Each occurrence of state $s$ in an episode is called a visit to $s$. The first-visit MC method estimates $v_\pi(s)$ as the average of the returns following from first visits to $s$, whereas every-visit MC method averages the returns following all visits to $s$.
An important fact about Monte Carlo method methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP.
If a model is not available, then it is particularly useful to estimate action values rather than state values. The Monte Carlo methods for this this are essentially the same as just presented for state values, except now we talk about visits to a state–action pair rather than to a state. A state–action pair $s, a$ is said to be visited in an episode if ever the state $s$ is visited and action $a$ is taken in it.
The only complication is that many state–action pairs may never be visited. This is the general problem of maintaining exploration.
The overall idea of how Monte Carlo estimation can be used in control is to according to the idea of generalized policy iteration (GPI). In GPI one maintains both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function.
We made two unlikely assumptions in order to easily obtain guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
The assumption that policy evaluation operates on an infinite number of episodes are relatively easy to remove. One of the approaches it to forgo trying to complete policy evaluation before returning to policy improvement. For Monte Carlo policy evaluation it is natural to alternate between evaluation and improvement on an episode-by-episode basis.
The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas on-policy methods evaluate or improve a policy different from that used to generate the data.
In on-policy control methods the policy is generally soft, meaning that $\pi(a \vert s)>0$ for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$, but gradually shifted closer and closer to a deterministic optimal policy.
All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions (to find the optimal actions). How can they learn about the optimal policy while behaving according to an exploratory policy? The on-policy approach in the preceding section is actually a compromise. It learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to gen- erate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
Suppose we wish to estimate $v_\pi$ or $q_\pi$, but we have all we have are episodes following another policy $\mu$, where $\mu \ne \pi$. In this case, $\pi$ is the target policy, $\mu$ is the behavior policy, and both policies are considered fixed and given. We require that $\pi(a \vert s)>0$ implies $\mu(a \vert s)>0$. This is called the assumption of coverage.
Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t,S_{t+1},A_{t+1},\dots,S_T$, occurring under any policy $\pi$ is
Thus, the relatvie probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is
We can define the set of all time steps in which state is visited, denoted $\mathcal{J}(s)$. This is for an every-visit method; for a first-visit method, $\mathcal{J}(s)$ would only include time steps that were first visits to $s$ within their episodes. Also, let $T(t)$ denote the first time of termination following time $t$, and $G_t$ denote the return after $t$ up through $T(t)$. Then $\{G_t\}_{t\in\mathcal{J}(s)}$ are the returns that pertain to state $s$, and $\{\rho_t^{T(t)}\}_{t\in\mathcal{J}(s)}$ are the corresponding importance-sampling ratios. To estimate $v_{\pi}(s)$, we simply scale the returns by the ratios and average the results:
When importance sampling is done as a simple average in this way it is called ordinary importance sampling.
An import alternative iis weighted importance sampling, which uses a weighted average, defined as
or zero if the denominator is zero.
The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased. On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. In fact, assuming bounded returns, the variance of the weighted importance-sampling estimator converges to zero even if the variance of the ratios themselves is infinite. In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred.
Suppose we have a sequence of returns $G_1, G_2, \dots, G_{n-1}$, all starting in the same state and each with a corresponding random weight $W_i$ (e.g., $W_i=\rho_t^{T(t)}$).