L3: Convolutional Neural Networks

Vision can be described as the ability to know what is where by looking. Humans are adept at seeing the world, processing our surroundings in just a fraction of a second, but to computers, vision is immensely complex.

Luckily, computer vision technology has nonetheless progressed far enough to be deployed to smartphones, cars, medical devices, and robots. This has largely been due to advancements in machine learning, specifically in convolutional neural networks (CNNs).

New York City Street Scene Photograph by Darren Martin - Pixels

New York City Street Scene Photograph by Darren Martin

How can we classify objects within the above image? For each class of object that we wish to identify, we can use prior domain knowledge to identify high-level features for each object. For pedestrians, this might include a head, hands, and legs. For cars, this might include tires, lights, and windows.

There are several issues with manually defining high-level features, however. Objects can rarely be simplified into explicit collections of features. In designing object detection models, we frequently encounter issues like these:

image-20230531140105688

Li/Johnson/Yeung C231n.

Rather than relying on manually-defined high-level features, we can design models that learn hierarchies of features directly from the training data.

image-20230531140246201

Lee et al., ICML 2009

Using spatial structure

Early machine learning applications for computer vision attempted to feed image data straight into fully connected networks. This meant that 2D images, sometimes having three dimensions due to color data, was flattened into a 1D input vector before being fed through the hidden layers.

This early approach had two major issues. First, the fully connected structure created a huge amount of parameters. For example, even a tiny 100x100 image fed into a single hidden layer with five neurons resulted in 100×100×5=50,000 weights alone!

Second, the flattening operation threw any spatial context from the image out the window. Relationships between objects across the 2D image plane were erased.

To fix these limitations, we can replace the naïve fully-connected architecture with a "patch" operation. For each feature that we wish to identify in the input image, we construct a convolutional filter, often called a kernel.

2D convolution animation

Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons

We then apply the filter in a moving window across each pixel in the input image:

(1)i=14j=14wijxi+p,j+q+b

where our kernel has weights wij. After the convolution operation has been applied once per feature map, the results are sent through a pooling (downsampling) operation, and finally sent to a fully-connected layer.

image-20230531142330231

Let's not forget that just as in fully connected networks, we must apply a nonlinearity (typically a ReLU) after each convolution operation.

A typical CNN classifier looks something like this:

Input Image
Convolution + ReLU
Pooling
Convolution + ReLU
Pooling
Flatten
Fully Connected
Softmax

Pooling

Pooling is an operation performed after convolution that downsamples the feature map and preserves spatial invariance. One simple operation is max pooling, which selects the max value from a patch given a patch size and stride.

Max pooling diagram

Max pooling of size 2x2 with stride 2. Papers With Code.

Similarly, we can find the mean within each patch (mean pooling), median, and so on.

Code sample

The below sample uses TensorFlow's Sequential API.

Approaches to object detection

Object detection is concerned with both identifying objects within an image and calculating their bounds (x,y,w,h). One naïve approach would be to draw numerous random bounding boxes within an input image, run a classifier model within each box (proposal), and return the proposals with high classification scores. The issue is that this generates far too many possibilities, which slows down our computation.

A 2013 paper by Girshick et al. introduced the R-CNN, which used a graph-based region proposal method called selective search to significantly reduce the number of model inputs. However, even with the reduced complexity, R-CNN turned out to be too slow in most cases for real-time use.

Today, one of the most popular approaches to object detection is Faster R-CNN, which instead learns region proposals directly. In this architecture, a convolutional feature extractor is placed before a CNN-based classifier, creating a single pipeline. This means that images are only passed through once, resulting in very high computation speeds.

Semantic segmentation

Rather than generate bounding boxes, semantic segmentation models classify each pixel of an image using fully convolutional networks (FCNs). In this architecture, which is made up entirely of convolutional layers, input images are progressively downsampled into a bottleneck, then upsampled to form the prediction map.