Shortcut Learning Hypothesis of Modern Language Models

2022-03-21 5:33 pm | Comments

Disclaimer: This post was completed in my spare time, with no relevance to my current work. And opinions in this post are my own, not my employers’.

As the already gigantic modern language models become ever larger by the day, people seem to ignore many existing works discussing their limitations. In this blog post, I try to connect the results and observations from several such works to form the “shortcut learning hypothesis” of those language models. Such a hypothesis implies that modern language models are fundamentally flawed to effectively capture long-range dependencies or complicated structures of the data, if we still stick with the current training objective. Hopefully, this post can help people take a step back and ponder a bit first before going all-in into the current scale competition of language models.


A 137B parameter language model fails to answer a simple question with prompting. (The example comes from this paper)

Navigate Through the Current AI Job Market: A Retrospect

2022-01-06 6:06 pm | Comments

Inspired by the fantastic talk focusing on career path doing AI research by Rosanne Liu and the amazing blog post on landing a job at top-tier AI labs by Aleksa Gordić, I want to share my recent experience to offer a more pragmatic perspective. The position specturm in the current AI industry can be roughly depicted in the figure below:


Figure 1: The AI Job Spectrum

While the aforementioned posts both focus on the research end of the spectrum, this post covers the whole range of the spectrum from my own experience. Out of the five virtual onsites I attended, I received three offers. One of them was a Machine Learning Software Engineer / Research Engineer role from Google. The other two were both Machine Learning Scientist roles from a large tech and a large financial company respectively. For the offer from Google, the team match process is quite lengthy; I’ve talked with eight different teams inside Google. All those different positions cover the whole spectrum from heavily research focused to purely product driven, as shown in the figure below:


Figure 2: My options and their rough positions on the spectrum

It’s a long grind from submitting resumes to making the final decision. During this process, I’ve had more than 20 rounds of discussion with different hiring managers, directors and other more senior people from these three companies. In addition, I’ve consulted with many friends of mine in the industry to get advice.

At the beginning of my job search, my preferrence was leaning towards the research end of the spectrum. However, as the things developed, my perception on the AI industry also changed gradually. Evenutally I choose a team in Google leaning towards the applied end of the spectrum. It’s a team in Cloud AI working on dialogue systems which heavily depends on the latest NLP technology (my expertise), generates a lot of value already and has great growth potential on tap. At the same time, there are also plenty of opportunities to collaborate with the research teams inside Google.

In this post, I’d like to share my whole journey accommpanied with some high-level suggestions and my thought process towards the final decision. Hopefully, this retrospect can provide some useful information for those who are passionate about AI and hope to find a matched position in the industry.

Notes on Convex Optimization (5): Newton's Method

2018-11-13 11:25 pm | Comments

For $x\in\mathbf{dom}\ f$, the vector

$\Delta x_{nt}=-\nabla^2 f(x)^{-1}\nabla f(x)$

is called the Newton step (for $f$, at $x$).

Minimizer of second-order approximation

The second-order Taylor approximation $\hat f$ of $f$ at $x$ is

\begin{equation} \hat f(x+v) = f(x) + \nabla f(x)^T v + \frac12 v^T \nabla^2 f(x) v. \tag{1} \label{eq:1} \end{equation}

which is a convex quadratic function of $v$, and is minimized when $v=\Delta x_{nt}$. Thus, the Newton step $\Delta x_{nt}$ is what should be added to the point $x$ to minimize the second-order approximation of $f$ at $x$.

Notes on Convex Optimization (4): Gradient Descent Method

2018-11-05 12:29 am | Comments

Descent methods

$x^{(k+1)}=x^{(k)} + t^{(k)}\Delta x^{(k)}$

$f(x^{(k+1)}) < f(x^{(k)})$
$\Delta x$ is the step or search direction; $t$ is the step size or step length
from convexity, $\nabla f(x)^T \Delta x <0$

General descent method.
given a starting point $x \in \mathbf{dom} \enspace f$.
repeat
- Determine a descent direction $\Delta x$.
- Line search. Choose a step size $t > 0$.
- Update. $x := x+ t\Delta x$.

until stopping criterion is satisfied

Notes on Convex Optimization (3): Unconstrained Minimization Problems

2018-09-29 3:15 pm | Comments

Unconstrained optimization problems are defined as follows:

\begin{equation} \text{minimize}\quad f(x) \tag{1} \label{eq:1} \end{equation}

where $f: \mathbf{R}^n \rightarrow \mathbf{R}$ is convex and twice continously differentiable (which implies that $\mathbf{dom}\enspace f$ is open). We denote the optimal value $\inf_xf(x)=f(x^\ast)$, as $p^\ast$. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^\ast$ to be optimal is

\begin{equation} \nabla f(x^\ast)=0. \tag{2} \label{eq:2} \end{equation}

Thus, solving the unconstrained minimization problem \eqref{eq:1} is the same as finding a solution of \eqref{eq:2}, which is a set of $n$ equations in the $n$ variables $x_1, \dots, x_n$. Usually, the problem must be solved by an iterative algorithm. By this we mean an algorithm that computes a sequence of points $x^{(0)}, x^{(1)}, \dots \in \mathbf{dom}\enspace f$ with $f(x^{(k)})\rightarrow p^\ast$ as $k\rightarrow\infty$. The algorithm is terminated when $f(x^{k}) - p^\ast \le \epsilon$, where $\epsilon>0$ is some specified tolerance.

[Notes on Mathematics for ESL] Chapter 10: Boosting and Additive Trees

2017-12-14 7:07 pm | Comments

10.5 Why Exponential Loss?

Derivation of Equation (10.16)

Since $Y\in{-1,1}$, we can expand the expectation as follows:

$\text{E}_{Y\vert x}(e^{-Yf(x)}) = \Pr(Y=1 \vert x)e^{-f(x)} + \Pr(Y=-1\vert x)e^{f(x)}$

In order to minimize the expectation, we equal derivatives w.r.t. $f(x)$ as zero:

$-\Pr(Y=1\vert x)e^{-f(x)}+\Pr(Y=-1\vert x)e^{f(x)}=0$

which gives:

$f^*(x)=\frac12\log\frac{\Pr(Y=1\vert x)}{\Pr(Y=-1\vert x)}$

[Notes on Mathematics for ESL] Chapter 6: Kernel Smoothing Methods

2017-10-27 6:57 pm | Comments

6.1 One-Dimensional Kernel Smoothers

Notes on Local Linear Regression

Locally weighted regression solves a separate weighted least squares problem at each target point $x_0$:

$\min_{\alpha(x_0),\beta(x_0)}\sum_{i=1}^NK_\lambda(x_0,x_i)[y-\alpha(x_0)-\beta(x_0)x_i]^2$

The estimate is $\hat f(x_0)=\hat\alpha(x_0)+\hat\beta(x_0)x_0$. Define the vector-value function $b(x)^T=(1,x)$. Let $\mathbf{B}$ be the $N \times 2$ regression matrix with $i$th row row $b(x_i)^T$, $\mathbf{W}(x_0)$ the $N\times N$ diagonal matrix with $i$th diagonal element $K_\lambda (x_0, x_i)$, and $\theta=(\alpha(x_0), \beta(x_0))^T$.

Then the above optimization problem can be rewritten as

$\min_\theta(y-\mathbf{B}\theta)^T\mathbf{W}(x_0)(y-\mathbf{B}\theta)$

Equal the derivative w.r.t $\theta$ as zero, we get

$\begin{split} &\mathbf{B}^T\mathbf{W}(x_0)(y-\mathbf{B}\hat\theta)=0 \\ &\mathbf{B}^T\mathbf{W}(x_0)\mathbf{B}\hat\theta = \mathbf{B}^T\mathbf{W}(x_0)y \\ &\hat\theta= (\mathbf{B}^T\mathbf{W}(x_0)\mathbf{B})^{-1}\mathbf{B}\mathbf{W}(x_0)y \end{split}$

[Notes on Mathematics for ESL] Chapter 5: Basis Expansions and Regularization

2017-10-24 8:34 pm | Comments

5.4 Smoothing Splines

Derivation of Equation (5.12)

Equal the derivative of Equation (5.11) as zero, we get

$\frac{\partial\text{RSS}(\theta,\lambda)}{\partial\theta} = -2N^T(y-N\theta)+2\lambda\Omega_N\theta = 0$

Put the terms related to $\theta$ on one side and the others on the other side, we get

$(N^TN+\lambda\Omega_N)\theta = N^Ty$

Multiply the inverse of $N^TN+\lambda\Omega_N$ on both sides completes the derivation of Equation (5.12)

$\hat\theta = (N^TN+\lambda\Omega_N)^{-1}N^Ty$

Notes on Convex Optimization (2): Convex Functions

2017-10-21 1:59 pm | Comments

1. Basic Properties and Examples

1.1 Definition

$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if $\mathbf{dom}\ f$ is a convex set and

$f(\theta x+(1-\theta)y)\le\theta f(x)+(1-\theta)f(y)$

for all $x,y\in \mathbf{dom}\ f, 0\le\theta\le1$

$f$ is concave if $-f$ is convex
$f$ is strictly convex if $\mathbf{dom}\ f$ is convex and $f(\theta x+(1-\theta)y)<\theta f(x)+(1-\theta)f(y)$ for $x,y\in\mathbf{dom}\ f,x\ne y, 0<\theta<1$

$f:\mathbb{R}^n \rightarrow \mathbb R$ is convex if and only if the function $g: \mathbb{R} \rightarrow \mathbb{R}$,

$g(t)=f(x+tv),\quad\mathbf{dom}\ g=\{t\mid x+tv\in\mathbf{dom}\ f\}$

is convex (in $t$) for any $x \in \mathbf{dom}\ f, v\in\mathbb R^n$

Notes on Convex Optimization (1): Convex Sets

2017-10-17 2:22 am | Comments

1. Affine and Convex Sets

Suppose $x_1\ne x_2$ are two points in $\mathbb{R}^n$.

1.1 Affine sets

line through $x_1$, $x_2$: all points

$x=\theta x_1 + (1-\theta)x_2\quad(\theta \in \mathbb{R})$

affine set: contains the line through any two distinct points in the set

$x_1,x_2\in C,\ \theta\in\mathbb{R} \Longrightarrow \theta x_1 + (1-\theta)x_2 \in C$

← Older Blog Archives

About Me

A researcher and engineer passionate about Machine Learning and Natural Language Processing

A calculated speculator seeking for risk and reward asymmetries, and a value investor sticking to the margin of safety.

Play piano and Nintendo Switch

Read history, investment, Sci-Fi and fantasy

Bodyweight fitness, hiking and skiing

Fan of Portland Trail Blazers, Former SNH48 member Kiku and Blackpink member Rosé

Toronto, Edmonton, Beijing and Suzhou

Opinions posted in this blog are my own!

Latest Tweets

Tweets by billy_nlp Follow @billy_nlp

Billy Ian's Short Leisure-time Wander

into learning, investment, intelligence and beyond