L2: RNNs, transformers and attention

In the previous lecture, we learned about the classic feed-forward network, which has taken the AI world pretty far. But feed-forward networks have a major limitation: they lack memory. Ask one of these networks to read a sentence and no matter how many layers it has, it will only remember one word at a time.

Recurrent neural networks are models that can process sequential data: sentences, multi-day weather reports, frames of a video, and so on. The ability to process sequential data, what we call sequence modeling, takes several forms:

TypeExample
One-to-one (no sequential data)Binary classification
Many to one (input is sequential)Sentiment classification. Sentence in, sentiment out.
One to many (output is sequential)Image captioning. Image in, sentence out.
Many to many (input and output are sequential)Machine translation

Revisiting the perceptron

Suppose we're tasked with predicting the motion of a ball given an input video. How could we extend the feed-forward network to process several video frames at once? One approach would be to create copies of the same model, feeding each copy its own distinct video frame in the sequence.

The issue with this is that the output of one model copy can only depend on its own input, not on the input data from other copies. The copies are completely disconnected. In our prediction example, the computer is forced to predict the ball's motion using a single static image of the ball. This turns our prediction into a random guess.

Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits

The naive "copy" approach fails to consider inputs from separate timesteps.

To add the missing connections, we can transform the traditional hidden layers (simplified as empty rectangles above) into recurrent cells. These cells have not one set of weights but three.

Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits

Let's go over our new model variables:

It's important to keep in mind that all three weight vectors are common across all timesteps so that the only independent variables are xt and t. h, as mentioned above, is dependent on xt and ht1. Just as in lecture one, to calculate our outputs, we multiply the independent variables by our weights and send the result through a nonlinear function:

(1)ht=tanh(WhhT+WxhTxt)yt^=WhyTht

Embedding and indexing

In order to properly model sequential data, we need to:

Given a fixed corpus of input data (for example, a word dictionary), embedding transforms each input's respective index into a vector of fixed size. Embedding can either be one-hot or learned. One-hot becomes impractical for large corpuses, as the results are highly sparse.

Backpropagation through time

In recurrent neural nets, backpropagation is done both

BTT is tricky to implement. Considering that typical networks can have millions or billions of weights, having many perceptron values greater than one can cause exploding gradients. Conversely, having many values less than one can cause vanishing gradients.

One solution to exploding gradients is simply to clip the gradient to a maximum value. Solutions to vanishing gradients include:

Since our gradients are calculated from the current back through time, vanishing gradients bias models toward short-term dependencies.

The most popular solution to vanishing gradients at the moment is the gated cell. In this approach, the RNN employs gates to control the flow of values through recurrent units. The most popular gated cell architecture is the long-short-term memory (LSTM) architecture.

In LSTMs, gated cells can choose to either forget, ignore, update, or output the information that they're given. This allows LSTMs to provide uninterrupted gradient flow.

Limits of RNNs

RNNs have three general limitations.

First, RNNs require sequential data to be fed in timestep by timestep, updating the model at each step. This creates an encoding bottleneck.

Second, RNNs are not practically parallelizable, creating another computation bottleneck.

Finally, the biggest limitation is that RNNs lack long-term memory. They can find patterns in hundreds of words, for example, but not millions.

Attention

Self-attention allows networks to: