L1: Foundations of Deep Learning

If we repeat a number of simple mathematical operations millions, billions, even trillions of times, these simple additions, multiplications, and so on begin to develop a complexity seemingly greater than the sum of their parts. Machines that perform enough calculations in just the right structure seem to develop their own personalities, forming "ghosts in the machine."

Machine learning has been around for a long time. Many of the underlying concepts, such as backpropagation and stochastic gradient descent, were established decades ago, but only recently has our technology allowed us to achieve the sheer scale required for the ghosts to emerge.

Let's begin by settling some broad terms. First, AI is a technique that enables computers to mimic human behavior. Within AI, machine learning (ML) is the ability for computers to learn without explicit programming. And within ML, deep learning is the extraction of patterns from data using neural networks.

The perceptron

Neural networks are made of individual units that are sometimes called "nodes," sometimes "neurons," and formally called perceptrons. Each perceptron solves math problems that might appear on a first grader's homework assignment.

A simple perceptron has one input x, one weight w, and an output y^. All are real numbers. This perceptron's only job is to find y^=wx, where w is basically just the slope of a line. It turns out that if we were to link all of these perceptrons together, forming a network of perhaps a billion of them, our network's final output would still be linearly related to our input. If our perceptrons simply multiply our input, creating a linear operation, then a network of these perceptrons will only be able to see the world in a linear fashion.

Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits

Our world is not linear. It's a lot more complicated than that. And so we need to add another piece to our perceptron, a non-linear function g(z). Let's also expand our perceptron to take in several inputs at once, so that our input becomes an n-dimensional vector x^. Finally, we'll add a bias term w0 to add flexibility, analogous to the line intercept term b in y=mx+b.

Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits

Our perceptron's calculation then becomes:

(11)y^=g(i=1mxiwi)

In other words, our perceptron first multiplies each input by its respective weight, sums all of these products together, and sends the result through our non-linear function.

What non-linearity should we choose for g(z)? There are many options, including the sigmoid, hyperbolic tangent, and a simple function called the "rectified linear unit," or ReLU.

image-20230520113646909`

For simplicity, we can restructure (11) into a matrix operation:

(12)y^=g(w0+XTW)

where

(13)X=[x1xm],W=[w1wm]

 

We stack and chain individual perceptrons together, forming a network where inputs are passed from the beginning, through several stacks of perceptrons, to a layer of outputs y^. We call this simple network architecture the sequential model.

Loss

Before we talk about how these networks are trained, let's consider how their performance can be measured in the first place. Loss quantifies the cost of a network's mistakes. For a given input (say a picture or audio clip) i, loss takes the form

(14)L(f(x(i),W),y(i))

where f(x(i),W) is our prediction and y(i) is the ground truth. Empirical loss is the total loss across an entire dataset:

(15)J(W)=1ni=1nL(f(x(i),W),y(i))

Cross entropy loss is used for models that output probabilities (values between 0 and 1):

(16)J(W)=1ni=1ny(i)log(f(x(i),W))+(1y(i))log(1f(x(i),W))

Finally, mean squared error (MSE) is used for regression models with continuous real numbers:

(17)J(W)=1ni=1n(y(i)f(x(i),W))2

Training neural nets

How can we minimize our loss? We simply need to find the optimal set of weights W that minimize J(W):

(18)W=argminWJ(W)

To determine the global minimum of a differentiable function, we can employ gradient descent. For each weight wiW, calculate the derivative δJ(W)δwi. Across all weights, this is equivalent to:

(19)J(w1,w2,,wn)=[Jw1(w1,w2,,wn)Jw2(w1,w2,,wn)Jwn(w1,w2,,wn)]

 

To calculate each partial derivative, we use the chain rule, working our way from the output back to the desired weight. This gives us the value by which we could adjust wi to increase the cost J(W) the most. We then nudge wi in the opposite direction by a small amount, dictated by the learning rate η.

Since this process requires us to calculate the partial derivates from the end of the network back to the beginning, we call it backpropagation.

Gradient descent for two variables (3D plot)

Gradient descent across two variables. [Source]

Example

Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits Dessins manuscrits

Consider the simplified network above. The partial derivatives via the chain rule are:

(20)δJ(W)δw2=δJ(W)δy^δy^δw2δJ(W)δw1=δJ(W)δy^δy^δz1δz1δw1

Neural networks in practice

Finding η

Finding η can be tricky. Make it too big, and the gradient descent will skip over the minimum, preventing convergence. Make it too small, and gradient descent will take a long time to converge, in some cases getting stuck in local minima.

Often, engineers find η by brute force, starting with their best guess and adjusting it. A smarter way would be to use some algorithm to adapt our learning rate during training based on the current characteristics of the network. There are lots of gradient descent algorithms that take this smarter approach, includng:

Batching

Gradient descent is a very expensive calculation to perform. To speed it up, we can run GD on just a selection of our training data. Since this selection should approximately represent the whole dataset, we should achieve similar results while achieving much higher performance.

Overfitting

Overfitting prevents our models from generalizing to new information. We can employ regularization techniques to prevent overfitting. These constrain the optimization problem during training to discourage complex models.

The most popular regularization method is called dropout. In this approach, we randomly select some nodes in our network and set their activations to zero. This random selection, typically about 50% of a network's perceptrons, changes during each training cycle, and prevents the network from relying too heavily on any single node.

Early stopping, a very broad category of regularization, identifies the point of divergence between training and test set accuracies during training. When the two accuracies diverge, this is a clear sign that overfitting has begun.