Beginning Machine Learning with Keras & Core ML

In this Keras machine learning tutorial, you’ll learn how to train a convolutional neural network model, convert it to Core ML, and integrate it into an iOS app. By Audrey Tam.

Leave a rating/review
Save for later

Apple’s Core ML and Vision frameworks have launched developers into a brave new world of machine learning, with an explosion of exciting possibilities. Vision lets you detect and track faces, and Apple’s Machine Learning page provides ready-to-use models that detect objects and scenes, as well as NSLinguisticTagger for natural language processing. If you want to build your own model, try Apple’s new Turi Create to extend one of its pre-trained models with your data.

But if what you want to do needs something even more customized? Then, it’s time to dive into machine learning (ML), using one of the many frameworks from Google, Microsoft, Amazon or Berkeley. And, to make life even more exciting, you’ll need to pick up a new programming language and a new set of development tools.

In this Keras machine learning tutorial you’ll learn how to train a deep-learning convolutional neural network model, convert it to Core ML, and integrate it into an iOS app. You’ll learn some ML terminology, use some new tools, and pick up a bit of Python along the way.

The sample project uses ML’s Hello-World example — a model that classifies hand-written digits, trained on the MNIST dataset.

Let’s get started!

Why Use Keras?

An ML model involves a lot of complex code, manipulating arrays and matrices. But ML has been around for a long time, and researchers have created libraries that make it much easier for people like us to create ML models. Many of these are written in Python, although researchers also use R, SAS, MATLAB and other software. But you’ll probably find everything you need in the Python-based tools:

  • scikit-learn provides an easy way to run many classical ML algorithms, such as linear regression and support vector machines. Our Beginning Machine Learning with scikit-learn tutorial shows you how to train these.
  • At the other end of the spectrum are PyTorch and Google’s TensorFlow, which give you greater control over the inner workings of your deep learning model.
  • Microsoft’s CNTK and Berkeley’s Caffe are similar deep learning frameworks, which have Python APIs to access their C++ engines.

So where does Keras fit in? It’s a wrapper around TensorFlow and CNTK, with Amazon’s MXNet coming soon. (It also works with Theano, but the University of Montreal stopped working on this in September 2017.) It provides an easy-to-use API for building models that you can train on one backend, and deploy on another.

Another reason to use Keras, rather than directly using TensorFlow, is that coremltools includes a Keras converter, but not a TensorFlow converter — although a TensorFlow to CoreML converter and a MXNet to CoreML converter exist. And while Keras supports CNTK as a backend, coremltools only works for Keras + TensorFlow.

Note: Do you need to learn Python before you can use these tools? Well, I didn’t ;] As you work through this tutorial, you’ll see that Python syntax is similar to Swift: a bit more streamlined, and indentation is an important part of the syntax. If you’re nervous, keep this open in a browser tab, for quick reference: Jason Brownlee’s Crash Course in Python for Machine Learning Developers.
Another Note: Researchers use both Python 2 and Python 3, but coremltools works better with Python 2.7.

Getting Started

Download and unzip the starter folder: it contains a starter iOS app, where you’ll add the ML model and code to use it. It also contains a docker-keras folder, which contains this tutorial’s Jupyter notebook.

Setting Up Docker

Docker is a container platform that lets you deploy apps in customized environments — sort of like a virtual machine, but different. Installing Docker gives you access to a large number of ML resources, mostly distributed as interactive Jupyter notebooks in Docker images.

Note: Installing Docker and building the image will take several minutes, so read the ML in a Nutshell section while you wait.

Download, install, and start Docker Community Edition for Mac. In Terminal, enter the following commands, one at a time:

cd <where you unzipped starter>/starter/docker-keras
docker build -t keras-mnist .
docker run --rm -it -p 8888:8888 -v $(pwd)/notebook:/workspace/notebook keras-mnist

This last command maps the Docker container’s notebook folder to the local notebook folder, so you’ll have access to files written by the notebook, even after you logout of the Docker server.

At the very end of the command output is a URL containing a token. It looks like this, but with a different token value:

Paste this URL into a browser to login to the Docker container’s notebook server.

Open the notebook folder, then open keras_mnist.ipynb. Tap the Not Trusted button to change it to Trusted: this allows you to save changes you make to the notebook, as well as the model files, in the notebook folder.

ML in a Nutshell

Arthur Samuel defined machine learning as “the field of study that gives computers the ability to learn without being explicitly programmed”. You have data, which has some features that can be used to classify the data, or use it to make some prediction, but you don’t have an explicit formula for computing this, so you can’t write a program to do it. If you have “enough” data samples, you can train a computer model to recognize patterns in this data, then apply its learning to new data. It’s called supervised learning when you know the correct outcomes for all the training data: then the model just checks its predictions against the known outcomes, and adjusts itself to reduce error and increase accuracy. Unsupervised learning is beyond the scope of this tutorial.

Weights & Threshold

Keras CoreML tutorial

Say you want to choose a restaurant for dinner with a group of friends. Several factors influence your decision: dietary restrictions, access to public transport, price range, type of food, child-friendliness, etc. You assign a weight to each factor, to indicate its importance for your decision. Then, for each restaurant in your list of options, you assign a value for each factor, according to how well the restaurant satisfies that factor. You multiply each factor value by the factor’s weight, and add these up to get the weighted sum. The restaurant with the highest result is the best choice. Another way to use this model is to produce binary output: yes or no. You set a threshold value, and remove from your list any restaurant whose weighted sum falls below this threshold.

Training an ML Model

Coming up with the weights isn’t an easy job. But luckily you have a lot of data from previous dinners, including which restaurant was chosen, so you can train an ML model to compute weights that produce the same results, as closely as possible. Then you apply these computed weights to future decisions.

To train an ML model, you start with random weights, apply them to the training data, then compare the computed outputs with the known outputs to calculate the error. This is a multi-dimensional function that has a minimum value, and the goal of training is to determine the weights that get very close to this minimum. The weights also need to work on new data: if the error over a large set of validation data is higher than the error over the training data, then the model is overfitted — the weights work too well on the training data, indicating training has mistakenly detected some feature that doesn’t generalize to new data.

Stochastic Gradient Descent

To compute weights that reduce the error, you calculate the gradient of the error function at the current graph location, then adjust the weights to “step down” the slope. This is called gradient descent, and happens many times during a training session. For large datasets, using all the data to calculate the gradient takes a long time. Stochastic gradient descent (SGD) estimates the gradient from randomly selected mini-batches of training data — like taking a survey of voters ahead of election day: if your sample is representative of the whole dataset, then the survey results accurately predict the final results.


The error function is lumpy: you have to be careful not to step too far, or you might miss the minimum. Your step rate also needs to have enough momentum to push you out of any false minimum. ML researchers have put a lot of effort into devising optimization algorithms to do this. The current favorite is Adam (Adaptive Moment estimation), which combines the features of previous favorites RMSprop (Root Mean Square propagation) and AdaGrad (Adaptive Gradient algorithm).