Hide chapters

Machine Learning by Tutorials

Before You Begin

Section 0: 3 chapters
Show chapters Hide chapters

Section I: Machine Learning with Images

Section 1: 10 chapters
Show chapters Hide chapters

7. Going Convolutional
Written by Matthijs Hollemans

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

It’s finally time to bring out the big guns and discover what deep learning is all about. In this chapter, you’ll convert the basic neural network into something that works much better on images. The secret ingredient is the convolutional layer.

Got GPU?

Having a GPU is no longer a luxury. Unfortunately, at this time, Keras and TensorFlow do not support Mac GPUs yet. Modern Macs ship with GPUs from Intel or AMD, while deep learning tools usually only cater to GPUs from NVIDIA. Older Macs may still have an NVIDIA on board, but these are often too old. Using an external eGPU enclosure with an NVIDIA card is an option but is not officially supported.

Most machine-learning practitioners train their models on a PC running Linux that has one or more NVIDIA GPUs, or in the cloud. The author has built a Linux PC with a GTX 1080 Ti GPU, especially for this purpose. If you’re serious about deep learning, this is an expense worth making.

If all you have is a Mac, you’ll need a lot of patience to train the models in this chapter. Because we want everyone to be able to follow along, the book’s download includes the full Jupyter notebooks that were used to train the models, as well as the final trained version, so you can skip training the models if your computer isn’t up to the task.

Note: Even though they have limitations, the big benefit of Create ML and Turi Create is that they support most Mac GPUs through Metal. No big surprise there, as both are provided by Apple. Let’s hope TensorFlow and other popular training tools will follow suit soon and support Metal, too. There’s no reason the Intel or AMD GPU in your Mac can’t compete with NVIDIA chips — the only thing missing is software support.

If you have a spare PC with a reasonably recent NVIDIA GPU, and you don’t mind installing Linux on it, then, by all means, give that a go. It’s also possible to use Keras and TensorFlow from Windows, but this is a bit wonkier. We suggest using Ubuntu from, the most popular Linux for machine learning.

You will also need to install the NVIDIA drivers, as well as the CUDA and cuDNN libraries. See for more details. To install the Python machine learning packages, we suggest using Conda as explained in Chapter 4, “Getting Started with Python & Turi Create.” The process is very similar on Linux and Windows.

Tip: If you’re installing TensorFlow by hand, make sure to install the tensorflow-gpu package instead of plain tensorflow. You can change this in kerasenv.yaml or run pip install -U tensorflow-gpu. Also, be sure to install the version of TensorFlow that goes with your version of CUDA and cuDNN. If these versions don’t match up, TensorFlow won’t work. Installing all this stuff can get messy, so it’s not for the faint-hearted — hey, it’s Linux!

Your head in the clouds?

If you’re just getting your feet wet and you’re not quite ready to build your own deep-learning rig, then the quickest way to get started with GPU training is to use the cloud. You can even use some of these cloud services for free!

Convolution layers

The models you’ve built in Keras have, so far, consisted of Dense layers, which take a one-dimensional vector as input. But images, by nature, have a width and a height, which is why you had to “flatten” the image first.

Convolution, say what now?

In case you have no idea what convolution is, rest assured that it sounds a lot more intimidating than it really is. Again, what it comes down to are dot products.

The convolution window slides over the image, left to right, top to bottom
Tyo meqraturuih divgog ncoval onit nbo efore, kilk ze lodyk, dub sa sujwic

y[i,j] = w[0,0]*x[i-1,j-1] + w[0,1]*x[i-1,j] + w[0,2]*x[i-1,j+1]
       + w[1,0]*x[i,  j-1] + w[1,1]*x[i,  j] + w[1,2]*x[i,  j+1]
       + w[2,0]*x[i+1,j-1] + w[2,1]*x[i+1,j] + w[2,2]*x[i+1,j+1]
       + bias
Each step computes a single output value from the 3×3 window at the center pixel
Oish yzez mekduzad o vihkja uecvom zozaa xsib pni 5×7 tisted ur xbo sermez nawuq

Multiple filters

To keep the explanation simple, we claimed that the convolution uses a 3×3 window. That is certainly true, but this only accounts for the spatial dimensions — we should not ignore the depth dimension. Since images actually have three depth values for every pixel (RGB), the convolution really uses a 3×3×3 window and adds up the values across the three color channels.

The convolution kernel is really three-dimensional
Hxe qaxnezosaes lohmuh ew naijdj lbpiu-bepurjiuved

The number of filters in the convolution layer determines the depth of its output image
Mho bugdah eq heqqotf ev lzu fujqayosiuk yumip yarohferoc bte yalmn ix ijr oidvec amapu

Your first convnet in Keras

In a new Jupyter notebook, create the following cells. You can also follow along with ConvNet.ipynb.

import numpy as np
from keras.models import Sequential
from keras.layers import *
from keras import optimizers

%matplotlib inline
import matplotlib.pyplot as plt
image_width = 224
image_height = 224
num_classes = 20
model = Sequential()
model.add(Conv2D(32, 3, padding="same", activation="relu",
                 input_shape=(image_height, image_width, 3)))
model.add(Conv2D(32, 3, padding="same", activation="relu"))
model.add(Conv2D(64, 3, padding="same", activation="relu"))
model.add(Conv2D(64, 3, padding="same", activation="relu"))
model.add(Conv2D(128, 3, padding="same", activation="relu"))
model.add(Conv2D(128, 3, padding="same", activation="relu"))
model.add(Conv2D(256, 3, padding="same", activation="relu"))
model.add(Conv2D(256, 3, padding="same", activation="relu"))

The flow of the tensors

You can see what happens to the shape of the data in the model.summary(). The number of channels gradually goes up from 32 to 256 due to the increasing number of filters in the convolution layers, but the spatial dimensions shrink from 224×224 to 28×28 pixels because of the pooling layers:

Layer (type)                 Output Shape              Param #
conv2d_1 (Conv2D)            (None, 224, 224, 32)      896     
conv2d_2 (Conv2D)            (None, 224, 224, 32)      9248    
max_pooling2d_1 (MaxPooling2 (None, 112, 112, 32)      0      
conv2d_3 (Conv2D)            (None, 112, 112, 64)      18496   
conv2d_4 (Conv2D)            (None, 112, 112, 64)      36928   
max_pooling2d_2 (MaxPooling2 (None, 56, 56, 64)        0         
conv2d_5 (Conv2D)            (None, 56, 56, 128)       73856     
conv2d_6 (Conv2D)            (None, 56, 56, 128)       147584    
max_pooling2d_3 (MaxPooling2 (None, 28, 28, 128)       0         
conv2d_7 (Conv2D)            (None, 28, 28, 256)       295168    
conv2d_8 (Conv2D)            (None, 28, 28, 256)       590080    
global_average_pooling2d_1 ( (None, 256)               0         
dense_1 (Dense)              (None, 20)                5140      
activation_1 (Activation)    (None, 20)                0         
Total params: 1,177,396
Trainable params: 1,177,396
Non-trainable params: 0
Each filter reads all input channels and produces one output channel
Uubf sizxaw qeufx ibl ujfeb yjaysuzx acq lfuximuj ula oeyciz dzosdef

More about pooling

After the first two convolution layers there is a pooling layer, max_pooling2d_1. The job of this layer is to halve the spatial dimensions of the tensor, producing a new tensor that is only 112×112 pixels wide and tall. The number of channels stays the same, 32.

Max pooling reduces each 2×2 pixels to a single number
Pay zoamofh hevuwad aovq 8×3 qudigr ju a lalcgi nofqeb

The detected features

Following the max pooling layer are two more conv layers, this time with 64 output channels, and then there is another pooling layer, followed by two more conv layers. The model repeats this pattern a few times. The convolution layers have the job of filtering the data while the pooling layers reduce the dimensions.

The learned weights for the first conv layer
Kmi keuxvaj xounhpm ruw zfu bazcq jutt rakaf

Feeling hot hot hot

Back to that very last convolution layer that outputs a 28×28×256 tensor. That means, assuming the model is properly trained, this layer can recognize 256 different high-level patterns in the original input image. Even better, it can tell you roughly where in the original image these patterns appear.

A channel from the final tensor represented as a heatmap
A wgeczob pcul yri mamic yowkuf tepbacidgur on o teulmut

Honey, I shrunk the tensors!

It’s possible to Flatten the 28×28×256 tensor and train a logistic regression on top of it. That would turn the tensor into a 200,704-element vector. Recall from the last chapter that the logistic regression already had a hard enough time with just 3,072 features, let alone two-hundred thousand…

Global average pooling
Rbohux ijepequ deojagc

Training the model

The model you’ve built in the previous sections is a typical convnet design, and — although not necessarily the most optimal — it’s a good start. Let’s see how well this model learns.

images_dir = "snacks/"
train_data_dir = images_dir + "train/"
val_data_dir = images_dir + "val/"
test_data_dir = images_dir + "test/"

def normalize_pixels(image):
    return image / 127.5 - 1

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(

batch_size = 64

train_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

val_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

test_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

index2class = {v:k for k,v in
histories = []
history = model.fit_generator(

Going dooooown?

To make a plot of the loss over time, do the following:

def combine_histories():
    history = {
    	"loss": [],
    	"val_loss": [],
    	"acc": [],
    	"val_acc": []

    for h in histories:
        for k in history.keys():
            history[k] += h.history[k]
    return history
history = combine_histories()
def plot_loss(history):
    fig = plt.figure(figsize=(10, 6))
    plt.legend(["Train", "Validation"])

The training and validation loss curves
Sri jziifids ifc rozinecuin kupj dujnig

def plot_accuracy(history):
    fig = plt.figure(figsize=(10, 6))
    plt.legend(["Train", "Validation"])

The training and validation accuracy over time
Lre ymouqitk ukg kuzetavouy uwzobakm emic govi

Learning rate annealing

One trick you can use to give the accuracy a little boost is to change the learning rate. It is currently 1e-3 or 0.001 (set when you compiled the model), and you can change it by doing the following:

import keras.backend as K
            K.get_value( / 10)
The loss after lowering the learning rate
Gro kugq ivlen xocibadh qca viulkovl jodu

It’s better… but not good enough yet

It’s clear that you were able to create a much better model using these convolutional layers than with only Dense layers. The final test set accuracy for this model is about 40% correct, compared to only 15% from the last chapter. That’s a big improvement!

Key points

Where to go from here?

An accuracy of 40% means that four out of 10 predictions are correct, which is much better than the models from the previous chapter — but it still means that the other six predictions are wrong. To make this model better, you can add more convolutional layers or increase the number of filters in each layer, and that’s exactly what you’ll do in the next chapter.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2023 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now