Hide chapters

Machine Learning by Tutorials

Before You Begin

Section 0: 3 chapters
Show chapters Hide chapters

Section I: Machine Learning with Images

Section 1: 10 chapters
Show chapters Hide chapters

9. Beyond Classification
Written by Matthijs Hollemans

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

The previous chapters have taught you all about image classification with neural nets. But neural networks can be used for many other computer vision tasks. In this chapter and the next, you’ll look at two advanced examples:

  • Object detection: find multiple objects in an image.
  • Semantic segmentation: make a class prediction for every pixel in the image.

Even though these new models are much more sophisticated than what you’ve worked with so far, they’re based on the same ideas. The neural network is a feature extractor and you use the extracted features to perform some task, whether that is classification, detecting objects, face recognition, tracking moving objects, or pretty much any other computer vision task.

That’s why you spent so much time on image classification: to get a solid grasp of the fundamentals. But now it’s time to take things a few steps further…

Where is it?

Classification tells you what is in the image, but always only considers the image as a whole. It works best when the picture has just one single thing of interest in it. If your classifier is trained to tell apart cats and dogs, and the image contains both a cat and a dog, then the answer is anyone’s guess.

An object detection model has no problem dealing with such images. The goal of object detection is to find all the objects inside an image, even if they are of different types. You can think of it as a classifier for specific image regions.

An object detector can find all your furry friends
An object detector can find all your furry friends

The object detector not only finds what the objects are but also where they are located in the image. It does this by predicting one or more bounding boxes, which are simply rectangular regions in the image.

A bounding box is described by four numbers, representing either the corner points of the rectangle or the center point plus a width and height:

The two types of bounding boxes
The two types of bounding boxes

Both types are used in practice, but this chapter uses the one with the corner points.

Each bounding box also has a class — the type of the object inside the box — and a probability that tells you how confident the model is in its prediction of both the bounding box coordinates and the class.

This may seem like a much more complicated task than image classification, but the building blocks are the same. You take a feature extractor — a convolutional neural network — and add a few extra layers on top that convert the extracted features into predictions. The difference is that this time, the model is not just making a prediction for the class but also predicts the bounding box coordinates.

Before we dive into building a complete object detector, let’s start with a simpler task. You will first extend last chapter’s MobileNet-based classification model so that, in addition to the regular class prediction, it also outputs a single bounding box that tries to localize where the most important object is positioned in the image.

Just predict one bounding box, how hard could it be? (Answer: It’s actually easier than you might think.)

The ground-truth will set you free

First, we should revisit the dataset.

import os, sys
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

data_dir = "snacks"
train_dir = os.path.join(data_dir, "train")
val_dir = os.path.join(data_dir, "val")
test_dir = os.path.join(data_dir, "test")
path = os.path.join(data_dir, "annotations-train.csv")
train_annotations = pd.read_csv(path)
The first five lines of annotations-train.csv
Pna nenzq hoyo kemuv aj ebvexaxiomh-bneik.yxg

val_annotations = pd.read_csv(os.path.join(data_dir,
test_annotations = pd.read_csv(os.path.join(data_dir,

Show me the data!

Now, let’s have a proper look at these bounding boxes. When dealing with images, it’s always a good idea to plot some examples to make sure the data is correct.

image_width = 224
image_height = 224

from helpers import plot_image
image_id      009218ad38ab2010
x_min                  0.19262
x_max                 0.729831
y_min                 0.127606
y_max                 0.662219
class_name                cake
folder                    cake
Name: 0, dtype: object
from keras.preprocessing import image

def plot_image_from_row(row, image_dir):
    # Load the image from "folder/image_id.jpg"
    image_path = os.path.join(image_dir, row["folder"],
                              row["image_id"] + ".jpg")
    img = image.load_img(image_path,
                    target_size=(image_width, image_height))

    # Put the box coordinates and class name into a tuple
    bbox = (row["x_min"], row["x_max"],
            row["y_min"], row["y_max"], row["class_name"])
    # Draw the bounding box on top of the image
    plot_image(img, [bbox])
annotation = train_annotations.iloc[0]
plot_image_from_row(annotation, train_dir)
The ground-truth box for row 0, cake (left) and row 2, ice cream (right)
Pza bhoeyp-rlusy xas doh sex 4, wira (secj) omf yiz 4, azi gboip (fahsb)

What about images without annotations?

If you have a dataset that consists of only images — and possibly class labels for the images — but no bounding box annotations, then you cannot train an object detector on that dataset. Not gonna happen; ain’t no two ways about it.

Your own generator

Previously, you used ImageDataGenerator and flow_from_directory() to automatically load the images and put them into batches for training. That is convenient when your images are neatly organized into folders, but the new training data consists of a Pandas DataFrame with bounding box annotations. You’ll need a way to read the rows from this dataframe into a batch. Fortunately, Keras lets you write your own custom generator.

from helpers import BoundingBoxGenerator

batch_size = 32
train_generator = BoundingBoxGenerator(
train_iter = iter(train_generator)
X, (y_class, y_bbox) = next(train_iter)
array([[ 0.348343,  0.74359 ,  0.55838 ,  0.936911],
       [ 0.102564,  0.746717,  0.062909,  0.93219 ],
       [ 0.      ,  1.      ,  0.135843,  0.98036 ],
       [ 0.448405,  0.978111,  0.288574,  0.880734],
array([ 9, 16, 12,  7,  8, 18, 10,  1, 14,  2,  7, 17, ...])
from helpers import labels
list(map(lambda x: labels[x], y_class))
class BoundingBoxGenerator(keras.utils.Sequence):
    def __len__(self):
        return len(self.df) // self.batch_size

    def __getitem__(self, index):
        # ... code ommitted ...
        return X, [y_class, y_bbox]

    def on_epoch_end(self):
        self.rows = np.arange(len(self.df))
        if self.shuffle:
def plot_image_from_batch(X, y_class, y_bbox, img_idx):
    class_name = labels[y_class[img_idx]]
    bbox = y_bbox[img_idx]
    plot_image(X[img_idx], [[*bbox, class_name]])

plot_image_from_batch(X, y_class, y_bbox, 0)
The generator seems to be working!
Tru qexoyilil tuusd ne we narjuwl!

X, (y_class, y_bbox) = next(train_iter)

A simple localization model

You’re now going to extend the existing MobileNet snacks classifier so that it has the ability to predict a bounding box as well as a class label.

import keras
from keras.models import Sequential
from keras.layers import *
from keras.models import Model, load_model
from keras import optimizers, callbacks
import keras.backend as K

checkpoint = "checkpoints/multisnacks-0.7162-0.8419.hdf5"
classifier_model = load_model(checkpoint)
num_classes = 20

# The MobileNet feature extractor is the first "layer".
base_model = classifier_model.layers[0]

# Add a global average pooling layer after MobileNet.
pool = GlobalAveragePooling2D()(base_model.outputs[0])

# Reconstruct the classifier layers.
clf = Dropout(0.7)(pool)
clf = Dense(num_classes, kernel_regularizer=regularizers.l2(0.01),
clf = Activation("softmax", name="class_prediction")(clf)
bbox = Conv2D(512, 3, padding="same")(base_model.outputs[0])
bbox = BatchNormalization()(bbox)
bbox = Activation("relu")(bbox)
bbox = GlobalAveragePooling2D()(bbox)
bbox = Dense(4, name="bbox_prediction")(bbox)
model = Model(inputs=base_model.inputs, outputs=[clf, bbox])
for layer in base_model.layers:
    layer.trainable = False
from keras.utils import plot_model
plot_model(model, to_file="bbox_model.png")
The model branches into two outputs
Xko muqus cgedhjej oqho jba oolhovz

layer_dict = { for i, layer in enumerate(model.layers)}

# Get the weights from the checkpoint model.
weights, biases = classifier_model.layers[-2].get_weights()

# Put them into the new model.

The new loss function

With the definition of the model complete, you now can compile it:

model.compile(loss=["sparse_categorical_crossentropy", "mse"],
              loss_weights=[1.0, 10.0],
              metrics={ "class_prediction": "accuracy" })
mse_loss = sum( (truth - prediction)**2 ) / (4*batch_size)
loss = crossentropy_loss + mse_loss + L2_penalties
loss = 1.0*crossentropy_loss + 10.0*mse_loss + 0.01*L2_penalties

Sanity checks

At this point, it’s a good idea to see what happens when you load an image and make a prediction. This should still work because the classifier portion of the model is exactly the same as in the last chapter.

from keras.applications.mobilenet import preprocess_input
from keras.preprocessing import image

img = image.load_img(train_dir + "/salad/2ad03070c5900aac.jpg",
                     target_size=(image_width, image_height))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

preds = model.predict(x)
plt.figure(figsize=(10, 5)), preds[0].squeeze())
plt.xticks(range(num_classes), labels, rotation=90, fontsize=20)
The classifier portion of the model already works
Gga sxudribaod nesfoah iz nyo tecux ixgaalw virbp

preds = model.predict_generator(train_generator)

Train it!

Now that all the pieces are in place, training the model is just like before. This model is again trained best on a machine with a fast GPU. (If you have a slow computer, it’s not really worth training this model yourself.)

val_generator = BoundingBoxGenerator(val_annotations, val_dir,
                                     image_height, image_width,
                                     batch_size, shuffle=False)
from helpers import combine_histories, plot_loss, plot_bbox_loss
histories = []
Epoch 1/5
220/220 [==============================] - 14s 64ms/step - loss: 1.8093 - class_prediction_loss: 0.4749 - bbox_prediction_loss: 0.1187 - class_prediction_acc: 0.8709 - val_loss: 1.2640 - val_class_prediction_loss: 0.5931 - val_bbox_prediction_loss: 0.0522 - val_class_prediction_acc: 0.8168
history = combine_histories(histories)
Loss for the bounding box predictions in the first 5 epochs
Huzk ter lxu zeospohd xif nwipomwoakw ij dqa juqzr 9 ixegvh


Sorry, this doesn’t mean I owe you any money. The acronym stands for Intersection-over-Union, although some people call it the Jaccard index.

IOU is the intersection divided by the union of the two boxes
IIU ag vgu uksudzicceah qizayiz yn tbe araob es rcu mxo zasev

from helpers import iou

bbox1 = [0.2, 0.7, 0.3, 0.6, "bbox1"]
bbox2 = [0.4, 0.6, 0.2, 0.5, "bbox2"]
iou(bbox1, bbox2)
plot_image(img, [bbox1, bbox2])
IOU between two bounding boxes
OUE fukkuug bwi zuubnuxp tovof

from helpers import iou, MeanIOU, plot_iou

model.compile(loss=["sparse_categorical_crossentropy", "mse"],
              loss_weights=[1.0, 10.0],
              metrics={ "class_prediction": "accuracy",
                        "bbox_prediction": MeanIOU().mean_iou })
The plot of the mean IOU
Tvo hdub ol xza miet AEU

Trying out the localization model

Just to get a qualitative idea of how well the model works, a picture says more than a thousand loss curves. So, write a function that makes a prediction on an image and plots both the ground-truth bounding box and the predicted one:

def plot_prediction(row, image_dir):
    # Same as before:
    image_path = os.path.join(image_dir, row["folder"],
                              row["image_id"] + ".jpg")
    img = image.load_img(image_path,
                         target_size=(image_width, image_height))

    # Get the ground-truth bounding box:
    bbox_true = [row["x_min"], row["x_max"],
                 row["y_min"], row["y_max"],

    # Make the prediction:
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    pred = model.predict(x)
    bbox_pred = [*pred[1][0], labels[np.argmax(pred[0])]]

    # Plot both bounding boxes and print the IOU:
    plot_image(img, [bbox_true, bbox_pred])   
    print("IOU:", iou(bbox_true, bbox_pred))
row_index = np.random.randint(len(test_annotations))
row = test_annotations.iloc[row_index]
plot_prediction(row, test_dir)
Pretty good!
Kxurpf suis!

Is this fair?
Ol ksur jeot?

Not great, but not really wrong either
Zel fraij, nuy pug wiolgy kwibx uehmus

Conclusion: not bad, could be better

The good news is that it was pretty easy to make the classification model perform a second task, predicting the bounding boxes. All you had to do was add another output to the model and make sure the training data had appropriate training annotations for that output. Once you have a generator for your data and targets, training the model is just a matter of running model.fit_generator().

Key points

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2023 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now