Neural networks can be thought of as function approximators. They can map some data to a decision (a prediction or a choice, typically) or vice versa. Consider 1989’s Universal Approximation Theorem:
A feedforward network with a single layer is sufficient to approximate, to an arbitrary precision, any continuous function. [1]
The theorem has a few caveats, however, including:
AI hype
Did theorems like the Universal Approximation Theorem hurt AI research with over-hype?
This is a fantasy.
One common adversarial attack on neural nets is the introduction of carefully-tuned random noise to the input image:
How can this happen? Remember that when training neural nets, we optimize their weights using gradient descent to minimize some loss function: \(\label{Eq1} W \leftarrow W - \eta \frac{\delta \mathcal{L}(W, x, y)}{\delta W}\) In other words, we ask how a small change in weights $W$ decreases our loss $\mathcal{L}$. In contrast, when crafting adversarial perturbations, we ask how a small increase in our input data $x$ can create a maximal increase in $\mathcal{L}$: \(\label{Eq2} x \leftarrow x + \eta \frac{\delta \mathcal{L}(W, x, y)}{\delta x}\)
Current neural nets rely on huge datasets. How can we add understanding of structure and prior knowledge into training to reduce our dependence on datasets?
One solution is the CNN, which:
What about more irregular data structures, as opposed to images? Let’s consider graphs to represent this data. Graphs are used for social networks, state machines, transit maps, molecular diagrams, biological relationships, and much more. The issue: Graphs cannot be captured by standard encodings or Euclidean geometries.
To encode graph structures, we can extend the CNN to graph convolutional networks (GCNs), which still hold a vector of weights. Instead of convolving a kernel across an image, it considers the local neighborhood centered around individual graph nodes.
Graph NNs can be naturally extended to point clouds by first generating meshes, then feeding them into GNNs.
In L4, we discussed variational autoencoders (VAEs) and generative adversarial networks (GANs). These have important limitations:
We seek generative models that are stable and efficient, capable of generating high-quality, original data.
In the case of VAEs and GANs, the task is to generate samples in one shot directly from low-dimensional latent variables– in other words, to learn the mapping from $z \rightarrow x$.
Rather than generate samples in one shot, diffusion models generate results iteratively by repeatedly refining and removing noise, starting from completely random data. By starting with this random “canvas,” diffusion methods can naturally generate results with much higher variability, avoiding mode collapse.
In forward noising (data to noise), we progressively add noise to some input image, gradually corrupting the data until we arrive at random noise. This is followed by reverse denoising (noise to data), which learns the mapping from one noisy stage to its early, less noisy neighbor.
Given image at some timestep $T$, can we learn to estimate the image at $T-1$? The loss function $\mathcal{L}(T, T-1)$ for reverse denoising is simply the pixel-wise difference between the two images.