Hide chapters

Machine Learning by Tutorials

Before You Begin

Section 0: 3 chapters
Show chapters Hide chapters

Section I: Machine Learning with Images

Section 1: 10 chapters
Show chapters Hide chapters

10. YOLO & Semantic Segmentation
Written by Matthijs Hollemans

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

You’ve seen how easy it was to add a bounding box predictor to the model: simply add a new output layer that predicts four numbers. But it was also pretty limited — this model only predicts the location for a single object. It doesn’t work so well when there are multiple objects of interest in the image.

You might think that you could just add more of these output layers, or perhaps predict 8 numbers for two bounding boxes, or 12 for three bounding boxes, etc. Good try, but unfortunately that doesn’t work so well in practice.

Each bounding box predictor will end up learning the same thing and, as a result, makes the same predictions. Instead of finding the locations of multiple objects, such a model will predict the same bounding box multiple times. And chances are, these bounding boxes will not actually enclose any of the objects but all end up somewhere in the middle of the image as a compromise.

To make a proper object detector, you need to encourage the different bounding box predictors to learn different things.

An old-school approach to object detection is to divide up the input image into many smaller, partially overlapping regions of different sizes, and then run a regular image classifier on each of these regions. This definitely works, but it gives a lot of duplicate detections. Even worse: It’s really slow. You need to run the classifier many, many, many times for each image.

A slightly smarter approach is to first try and figure out which parts of the image are potential regions of interest. This is the approach taken by the popular R-CNN family of models. The classifier is still run on multiple image regions, but now only on regions that are at least somewhat likely to have an object in them.

To predict which regions are potentially interesting, the “Faster R-CNN” model uses a Region Proposal Network, which sounds impressive but is really just a bunch of layers on top of the feature extractor — hey, what did you expect? Unfortunately, even though it has “Faster” in its name, this model is still on the slow side and not really suitable for mobile devices.

For speed freaks and mobile device users, the so-called single stage detectors are very appealing. As the name implies, these model types just run the classifier once on the input image and do all of the work in a single pass. Examples of single-stage object detectors are YOLO (You Only Look Once), SSD (Single Shot multi-box Detector) and DetectNet.

Turi Create lets you train a YOLO model with just a few lines of code, so that’s what you’ll do next.

Single stage detectors

The simplest form of a single stage detector, and the one you’ll be training, looks like this:

Again, there’s a feature extractor plus a few layers on top. The YOLO feature extractor is called Darknet, and it’s not so different from the feature extractors you’ve seen before: Darknet consists of convolution layers, followed by batch normalization and the ReLU activation function, with pooling layers in between.

Note: The activation function used by Darknet is actually a variation of ReLU, known as leaky ReLU. Where a regular ReLU completely removes any values that are less than zero, the leaky version makes negative values a lot smaller but still lets them “leak through.”

The extra layers are all convolutional. Unlike before, where the output of the model was either a vector containing a probability distribution or the coordinates for the bounding box, the output of YOLO is a three-dimensional tensor of size 13 × 13 × 375 that we’ll refer to as the grid.

YOLO takes a 416×416 pixel image as input. That’s larger than what you typically use for classification. This way, small details don’t get lost. There are five pooling layers in Darknet that each halve the spatial dimensions of the image, for a total reduction factor of 32. Since 416/32 = 13, the final grid is 13×13 pixels.

Looking at this the other way around, each of the cells in this grid refers to a 32×32 block of pixels in the original image. Each cell is therefore responsible for detecting objects in or around that particular 32×32 region of the input image.

Each cell in the grid is responsible for its own region in the original image
Each cell in the grid is responsible for its own region in the original image

YOLO, therefore, has 13×13 = 169 different bounding box predictors, and each of these is assigned to look only at a specific location in the image. Actually, this isn’t entirely true: Each grid cell has not just one but 15 different predictors, for a total of 169×15 = 2,535 bounding box predictors across the entire image. That’s quite an upgrade over the simple model you made previously!

Having multiple predictors per grid cell means you can let bounding box predictors specialize in different shapes and sizes of objects. Each cell will have a predictor that looks for small objects, a different predictor that looks for large objects, one that looks for wide but flat objects, one that looks for narrow but tall objects, and so on.

This is where the number 375 comes from, the depth dimension of the output grid: Each grid cell has 15 predictors that each output 25 numbers. Why 25? This is made up of the probability distribution over our snack classes, so that’s 20 numbers. It also includes four numbers for the bounding box coordinates. Finally, YOLO also predicts a confidence score for the bounding box: how likely it thinks this bounding box actually contains an object. So there are two confidences being predicted here: one for the class, and one for the bounding box.

Because the output of YOLO is a 13×13×375 tensor, it’s important to realize it always predicts 2,535 bounding boxes for every image you give it. Even if the image doesn’t contain any recognizable objects at all, YOLO still outputs 2,535 bounding boxes — whether you want them or not.

That’s why the confidence score is important: It tells you which boxes you can ignore. In an image with no or just a few objects, the vast majority of predicted boxes will have low confidence scores. So at least YOLO is kind enough to tell you which of these 2,535 predictions are rubbish.

Even after you filter out all the boxes with low confidence scores — for example, anything with a score less than 0.25 — you’ll still end up with too many predictions. This kind of situation is typical:

I’m only counting one dog and cat, not three!
I’m only counting one dog and cat, not three!

These are all bounding boxes that the model feels good about since they have high scores, but as a consumer of an object detection model, you really want to have only a single bounding box for each object in the image. This sort of thing happens because nearby cells may all make a prediction for the same object — especially when the object is larger than 32×32 pixels.

To filter out these overlapping predictions, a post-processing technique called non-maximum suppression or NMS is used to remove such duplicates. The NMS algorithm keeps the predictions with the highest confidence scores and removes any other boxes that overlap the ones with higher scores by more than a certain threshold, say an IOU of 45% or more. The model created by Turi Create automatically takes care of this post-processing step for you, so you don’t have to worry about any of this.

Note: Turi’s object detection model is known as TinyYOLO because it’s smaller than the full YOLO. The full version of YOLO has multiple output grids of varying dimensions in order to handle different object sizes better, but this model is also larger and slower. Another popular single-stage detector is SSD. Architecturally, YOLO and SSD are very similar in design and differ only in the details. SSD does not have its own feature extractor and can be used with many different convnets. Particularly suitable for use on mobile is the combination of SSD and MobileNet.

Hello Turi, my old friend

Switch to the turienv Python environment and create a new Jupyter notebook. You can find the environment in the starter project of this chapter’s materials. Refer back to Chapter 4: Getting Started with Python & Turi Create if you don’t remember how to activate environments.

import os, sys, math
import pandas as pd
import turicreate as tc
[ {'coordinates': {'height': 129, 'width': 151, 'x': 75, 'y': 186},
   'label': 'juice'},
  {'coordinates': {'height': 130, 'width': 170, 'x': 228, 'y': 191},
   'label': 'juice'},
  {'coordinates': {'height': 129, 'width': 153, 'x': 76, 'y': 191},
   'label': 'juice'} ],
def load_images_with_annotations(images_dir, annotations_file):
    # Load the images into a Turi SFrame.
    data = tc.image_analysis.load_images(images_dir, with_path=True)

    # Load the annotations CSV file into a Pandas dataframe.
    csv = pd.read_csv(annotations_file)
    all_annotations = []
    for i, item in enumerate(data):
        # Grab image info from the SFrame.
        img_path = item["path"]
        img_width = item["image"].width
        img_height = item["image"].height

        # Find the corresponding row(s) in the CSV's dataframe.
        image_id = os.path.basename(img_path)[:-4]
        rows = csv[csv["image_id"] == image_id]
        img_annotations = []
        for row in rows.itertuples():
            xmin = int(round(row[2] * img_width))
            xmax = int(round(row[3] * img_width))
            ymin = int(round(row[4] * img_height))
            ymax = int(round(row[5] * img_height))

            # Convert to center coordinate and width/height:
            width = xmax - xmin
            height = ymax - ymin
            x = xmin + math.floor(width / 2)
            y = ymin + math.floor(height / 2)
            class_name = row[6]

                  {"height": height, "width": width, "x": x, "y": y},
                  "label": class_name})
        if len(img_annotations) > 0:
    data["annotations"] = tc.SArray(data=all_annotations, dtype=list)
    return data.dropna()
data_dir = "snacks"
train_dir = os.path.join(data_dir, "train")

train_data = load_images_with_annotations(train_dir,
                    data_dir + "/annotations-train.csv")
The SFrame now contains the annotations dictionaries
Lfi HZtesu qok ladjaiqq dpi ijyagediidx bopzaecifoar

util = tc.object_detector.util
train_data["image_with_ground_truth"] = util.draw_bounding_boxes(
Viewing the ground-truth boxes on the training images
Faazopm dba vcuotb-rsuln nicem aw mha sqoivafk inopox

Training the model

It just takes a single line of code and a whole lot of patience:

model = tc.object_detector.create(train_data, feature="image",
Setting 'batch_size' to 32
Using GPU to create model (GeForce GTX 1080 Ti)
Setting 'max_iterations' to 13000
| Iteration    | Loss         | Elapsed Time |
| 1            | 11.276       | 12.7         |
| 36           | 10.892       | 22.8         |
| 71           | 10.506       | 32.8         |
| 107          | 10.517       | 43.1         |
| 12999        | 2.106        | 3755.3       |

How good is it?

In case you don’t have the hardware or the time to train this model yourself, we’ve included the trained model in the downloads as a .zip file in the final folder, Unzip this model to your working directory and then load it into the notebook:

model = tc.load_model("SnackDetector.model")
test_dir = os.path.join(data_dir, "test")
test_data = load_images_with_annotations(test_dir,
                   data_dir + "/annotations-test.csv")
scores = model.evaluate(test_data)
{'average_precision_50': {
  'apple': 0.52788541232511876,
  'banana': 0.41939129680862453,
  'cake': 0.38973319479991153,
  'candy': 0.36857447872282678,
  'watermelon': 0.37970409310715819},
 'mean_average_precision_50': 0.38825907147323535}
test_data["predictions"] = model.predict(test_data)
[{'confidence': 0.7225357099539148,
  'coordinates': {'height': 73.92794444010806,
                  'width': 90.45315889211807,
                  'x': 262.2198759929745,
                  'y': 155.496952970812},
  'label': 'dog',
  'type': 'rectangle'},
test_data["image_with_predictions"] =
Viewing the predicted bounding boxes
Soixexl nci vkucagsal viansafd gefoy

The demo app

This is a book about machine learning on iOS, and it’s been a while since we’ve seen the inside of Xcode, so let’s put the trained YOLO model into an app. The book downloads contain a demo app named ObjectDetection.

The YOLO model in Core ML
Cmi QOMU huxap ak Cusi QN

The YOLO model in action on the iPhone
Pva BENA dorib ox ivwoum um fso aYjodu

if #available(iOS 13.0, *) {
  visionModel.inputImageFeatureName = "image"
  visionModel.featureProvider = try MLDictionaryFeatureProvider(
      dictionary: [
    "iouThreshold": MLFeatureValue(double: 0.45),
    "confidenceThreshold": MLFeatureValue(double: 0.25),

Semantic segmentation

You’ve seen how to do classification of the image as a whole, as well as classification of the contents of bounding boxes, but it’s also possible to make a separate classification prediction for each individual pixel in the image. This is called semantic segmentation. Here’s what it looks like:

Semantic segmentation makes a class prediction for every pixel
Juxotqiq xewnidgorouj zumak o rxinh mrowukweoq piz etajz mobuc

DeepLab on top of MobileNetV2
KoalJes ed hen er ZeyuniCutJ0

Converting the model

You’re going to be using a pre-trained version of DeepLab that is made freely available as part of the TensorFlow Models repository, at:

Part of the frozen inference graph in Netron
Lafp om kki xhupuj ixxiruntu gfuqj ab Kenxej

$ pip install -U tfcoreml
$ pip install -U git+
import tfcoreml as tf_converter

input_path = "deeplabv3_mnv2_pascal_trainval/frozen_inference_graph.pb"
output_path = "DeepLab.mlmodel"
input_tensor = "ImageTensor:0"
input_name = "ImageTensor__0"
output_tensor = "ResizeBilinear_3:0"

         input_name_shape_dict={input_tensor : [1, 513, 513, 3]},
$ python3
The summary page for DeepLab.mlmodel
Dzi geykojq lizo qex DuizHig.cnqucip

The demo app

The downloads for this chapter include an app called Segmentation. The code is very similar to that of the HealthySnacks app from a few chapters ago, except now there are two pairs of camera/photo library buttons, allowing you to select a background image and a foreground image.

The author on well-deserved fake holiday (left), the corresponding segmentation mask (right)
Sto oazzis ux kunm-viyijdov leci hexezuq (sezf), kca tolcomvaszafv nabtattudoil kotf (vojgj)

required init?(coder aDecoder: NSCoder) {
  let outputs = deepLab.model.modelDescription.outputDescriptionsByName
  guard let output = outputs["ResizeBilinear_3__0"],
        let constraint = output.multiArrayConstraint else {
    fatalError("Expected 'ResizeBilinear_3__0' output")
  deepLabHeight = constraint.shape[1].intValue
  deepLabWidth = constraint.shape[2].intValue

  super.init(coder: aDecoder)
func processObservations(for request: VNRequest, error: Error?) {
  if let results = request.results as? [VNCoreMLFeatureValueObservation],
     let multiArray = results[0].featureValue.multiArrayValue {

    DispatchQueue.main.async { multiArray)
let classes = features.shape[0].intValue
let height = features.shape[1].intValue
let width = features.shape[2].intValue
var pixels = [UInt8](repeating: 255, count: width * height * 4)
let value = features[[c, y, x] as [NSNumber]].doubleValue
let featurePointer = UnsafeMutablePointer<Double>(
let cStride = features.strides[0].intValue
let yStride = features.strides[1].intValue
let xStride = features.strides[2].intValue
let value = featurePointer[c*cStride + y*yStride + x*xStride]
for y in 0..<height {
  for x in 0..<width {

    // Take the argmax for this pixel, the index of the largest class.
    var largestValue: Double = 0
    var largestClass = 0
    for c in 0..<classes {
      let value = featurePointer[c*cStride + y*yStride + x*xStride]
      if value > largestValue {
        largestValue = value
        largestClass = c
    . . .


Challenge 1: Create a dataset for object detection

If you collected your own classification dataset for one of the previous challenges, then use a tool such as RectLabel to add bounding box annotations for these images. RectLabel uses a different file format to store the annotations, but has code examples that show how to use these files with Turi Create.

Challenge 2: Train MobileNet+SSD on the snacks dataset

The size of the YOLO model you trained with Turi Create is 64.6 MB. That’s pretty hefty! This is reaching the upper limit of what is acceptable on mobile devices. It’s possible to use object detection models that are much smaller than YOLO that give very good results, such as SSD on top of MobileNet (about 26 MB).

Challenge 3: Change the semantic segmentation demo app

Change the semantic segmentation demo app to only keep pixels that belong to cats and dogs — or whatever your favorites are from the 20 Pascal VOC classes.

Key points

Where to go from here?

Congrats, you’ve reached the end of section 1, Machine Learning with Images! Of course, we hope this is really only the beginning of your journey into the wonderful world of computer vision and deep learning.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2023 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now