L3: Convolutional Neural NetworksUsing spatial structurePoolingCode sampleApproaches to object detectionSemantic segmentation
Vision can be described as the ability to know what is where by looking. Humans are adept at seeing the world, processing our surroundings in just a fraction of a second, but to computers, vision is immensely complex.
Luckily, computer vision technology has nonetheless progressed far enough to be deployed to smartphones, cars, medical devices, and robots. This has largely been due to advancements in machine learning, specifically in convolutional neural networks (CNNs).
New York City Street Scene Photograph by Darren Martin
How can we classify objects within the above image? For each class of object that we wish to identify, we can use prior domain knowledge to identify high-level features for each object. For pedestrians, this might include a head, hands, and legs. For cars, this might include tires, lights, and windows.
There are several issues with manually defining high-level features, however. Objects can rarely be simplified into explicit collections of features. In designing object detection models, we frequently encounter issues like these:
Li/Johnson/Yeung C231n.
Rather than relying on manually-defined high-level features, we can design models that learn hierarchies of features directly from the training data.
Lee et al., ICML 2009
Early machine learning applications for computer vision attempted to feed image data straight into fully connected networks. This meant that 2D images, sometimes having three dimensions due to color data, was flattened into a 1D input vector before being fed through the hidden layers.
This early approach had two major issues. First, the fully connected structure created a huge amount of parameters. For example, even a tiny 100x100 image fed into a single hidden layer with five neurons resulted in
Second, the flattening operation threw any spatial context from the image out the window. Relationships between objects across the 2D image plane were erased.
To fix these limitations, we can replace the naïve fully-connected architecture with a "patch" operation. For each feature that we wish to identify in the input image, we construct a convolutional filter, often called a kernel.
Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons
We then apply the filter in a moving window across each pixel in the input image:
where our kernel has weights
Let's not forget that just as in fully connected networks, we must apply a nonlinearity (typically a ReLU) after each convolution operation.
A typical CNN classifier looks something like this:
Pooling is an operation performed after convolution that downsamples the feature map and preserves spatial invariance. One simple operation is max pooling, which selects the max value from a patch given a patch size and stride.
Max pooling of size 2x2 with stride 2. Papers With Code.
Similarly, we can find the mean within each patch (mean pooling), median, and so on.
The below sample uses TensorFlow's Sequential API.
x
import tensorflow as tf
def generateModel():
model = tf.keras.Sequential([
# First convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu'),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# Second convolutional layer
tf.keras.layers.Conv2D(64, filter_size=3, activation='relu'),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# Fully connected classifier
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax') # 10 outputs
])
return model
Object detection is concerned with both identifying objects within an image and calculating their bounds
A 2013 paper by Girshick et al. introduced the R-CNN, which used a graph-based region proposal method called selective search to significantly reduce the number of model inputs. However, even with the reduced complexity, R-CNN turned out to be too slow in most cases for real-time use.
Today, one of the most popular approaches to object detection is Faster R-CNN, which instead learns region proposals directly. In this architecture, a convolutional feature extractor is placed before a CNN-based classifier, creating a single pipeline. This means that images are only passed through once, resulting in very high computation speeds.
Rather than generate bounding boxes, semantic segmentation models classify each pixel of an image using fully convolutional networks (FCNs). In this architecture, which is made up entirely of convolutional layers, input images are progressively downsampled into a bottleneck, then upsampled to form the prediction map.