arrow_back Back home

Lecture 7: Limitations and New Frontiers

Neural networks can be thought of as function approximators. They can map some data to a decision (a prediction or a choice, typically) or vice versa. Consider 1989’s Universal Approximation Theorem:

A feedforward network with a single layer is sufficient to approximate, to an arbitrary precision, any continuous function. [1]

The theorem has a few caveats, however, including:

  • The number of hidden units may be infeasibly large.
  • The resulting model may not generalize.
  • How can this single-layer network’s weights be determined?

image-20230615181801598

AI hype

Did theorems like the Universal Approximation Theorem hurt AI research with over-hype?

Deep learning is not alchemy

image-20230615183318136

This is a fantasy.

Adversarial attacks

One common adversarial attack on neural nets is the introduction of carefully-tuned random noise to the input image:

image-20230615184056298

How can this happen? Remember that when training neural nets, we optimize their weights using gradient descent to minimize some loss function: \(\label{Eq1} W \leftarrow W - \eta \frac{\delta \mathcal{L}(W, x, y)}{\delta W}\) In other words, we ask how a small change in weights $W$ decreases our loss $\mathcal{L}$. In contrast, when crafting adversarial perturbations, we ask how a small increase in our input data $x$ can create a maximal increase in $\mathcal{L}$: \(\label{Eq2} x \leftarrow x + \eta \frac{\delta \mathcal{L}(W, x, y)}{\delta x}\)

New Frontiers I: Encoding structure into deep learning

Current neural nets rely on huge datasets. How can we add understanding of structure and prior knowledge into training to reduce our dependence on datasets?

One solution is the CNN, which:

  1. Applies a set of weights to extract local features.
  2. Uses multiple filters to extract various features.
  3. Spatially shares parameters of each filter.

What about more irregular data structures, as opposed to images? Let’s consider graphs to represent this data. Graphs are used for social networks, state machines, transit maps, molecular diagrams, biological relationships, and much more. The issue: Graphs cannot be captured by standard encodings or Euclidean geometries.

To encode graph structures, we can extend the CNN to graph convolutional networks (GCNs), which still hold a vector of weights. Instead of convolving a kernel across an image, it considers the local neighborhood centered around individual graph nodes.

Applications of Graph Neural Networks

  • Molecular discovery (pharmaceuticals) [2] [3]
  • Traffic prediction [4]
  • Covid-19 forecasting, both spatially and temporally [5]

Point clouds (3D data)

Graph NNs can be naturally extended to point clouds by first generating meshes, then feeding them into GNNs.

New Frontiers II: Generative AI and Diffusion Models

The landscape of generative modeling

In L4, we discussed variational autoencoders (VAEs) and generative adversarial networks (GANs). These have important limitations:

  • Mode collapse: in the generative process, VAEs and GANs tend to generate results that are repetitive and unoriginal. A regression to some average value.
  • Cannot generate original, diverse information
  • Difficult to train or scale

We seek generative models that are stable and efficient, capable of generating high-quality, original data.

Diffusion models

In the case of VAEs and GANs, the task is to generate samples in one shot directly from low-dimensional latent variables– in other words, to learn the mapping from $z \rightarrow x$.

Rather than generate samples in one shot, diffusion models generate results iteratively by repeatedly refining and removing noise, starting from completely random data. By starting with this random “canvas,” diffusion methods can naturally generate results with much higher variability, avoiding mode collapse.

The diffusion process

In forward noising (data to noise), we progressively add noise to some input image, gradually corrupting the data until we arrive at random noise. This is followed by reverse denoising (noise to data), which learns the mapping from one noisy stage to its early, less noisy neighbor.

Forward noising

  1. Given an image, randomly sample a random noise pattern.
  2. Progressively add more of the noise to the input image, starting with 0% noise and ending with 100% noise over several timesteps.

Reverse denoising

Given image at some timestep $T$, can we learn to estimate the image at $T-1$? The loss function $\mathcal{L}(T, T-1)$ for reverse denoising is simply the pixel-wise difference between the two images.