Beginning Data Science with Jupyter Notebook and Kotlin

This tutorial introduces the concepts of Data Science, using Jupyter Notebook and Kotlin. You’ll learn how to set up a Jupyter notebook, load krangl for Kotlin and use it in data science utilizing a built-in sample data. By Joey deVilla.

Leave a rating/review
Download materials
Save for later
Share

This article builds on Create Your Own Kotlin Playground (and Get a Data Science Head Start) with Jupyter Notebook, which covers the following:

  • Jupyter Notebook, an open source web application that lets you create documents that mix text, graphics, and live code. Its native programming language is Python.
  • The Kotlin kernel for Jupyter Notebook adds a Kotlin interpreter. The previous article showed how you could use Jupyter Notebook and the Kotlin Kernel as an interactive “playground” for Kotlin code.
  • krangl, a library whose name is short for Kotlin library for data wrangling.
  • Data frames, which are data structures that represent tables of data organized into rows and columns. They’re the primary data structure in data science applications.

In this article, you’ll take what you’ve learned and build on it by performing basic data science.

Getting Started

If you’re not already running Jupyter Notebook, launch it now. The simplest way to do so is to enter the following on the command line:

jupyter notebook

Your computer will open a new browser window or tab, and it will contain a page that looks similar to this:

The initial Jupyter Notebook page, which displays the files and folders in the current directory.

Create a new Kotlin notebook by clicking on the New button near the page’s upper right corner. A menu will appear, one of the options will be Kotlin; select that option to create a new Kotlin notebook.

A new browser tab or window will open, containing a page that looks like this:

A newly created Jupyter Notebook with a single blank code cell.

The page will contain a single cell, a code cell (the default). Enter the following into the cell:

%use krangl

Now run it, either by clicking the Run button or by typing Shift-Enter. This will import krangl.

The cell should look like this for a few seconds…

A Jupyter Notebook cell while code is executing. An asterisk appears beside the cell while its code is executing.

…then it will look like this:

A Jupyter Notebook cell after code has executed. The asterisk is replaced by a number.

At this point, krangl should have loaded, and you can now use it for data science.

Working with krangl’s Built-In Data Frames

People who are just getting started with data science often struggle with finding good datasets to work with. To counter this problem, krangl provides three datasets in the form of three built-in DataFrame instances.

Going from smallest to largest, these built-in data frames are:

  • sleepData: Data about the sleep patterns of 83 species of animal, including humans.
  • irisData: A set of measurements of 150 flowers from different species of iris.
  • flightsData: Over 330,000 rows of data about flights taking off from New York City airports in 2013.

Getting a Data Frame’s First and Last Rows

The datasets that you’ll often encounter will contain a lot of observations, possibly hundreds, thousands, or even millions. In order not to overwhelm you, krangl limits the amount it displays by default.

If you simply enter the name of a large data frame into a code cell and run it, you’ll see only the first six rows, followed by a count of the remaining rows and the data frame’s dimensions.

Try it with sleepData. Run the following in a new code cell:

sleepData

You’ll see this result:

A table displaying the first 6 rows of the sleepData data frame. Text at the bottom says there are 77 more rows and that the data frame's shape is 83 by 11.

Notice that krangl tells you how many more rows there are in the data frame, 77 in this case, as well as the data frame’s dimensions: 83 rows and 11 columns. This is important because it tells you that while krangl is only showing you the first 6 rows, sleepData actually has 83 rows of data.

To see only the first n rows of the data frame, you can use the head() method. Run the following in a new code cell:

sleepData.head()

The notebook will respond by returning a data frame made up of the first five rows of sleepData:

A table displaying the first 5 rows of the sleepData data frame. Text at the bottom says that the data frame's shape is 5 by 11.

head() returns a new data frame made from the first n rows of the original data frame. By default, n is 5.

Try getting the first 20 rows of sleepData by running the following in a new code cell:

sleepData.head(20)

You’ll see these results:

A table displaying the first 6 rows of the sleepData data frame. Text at the bottom says there are 14 more rows and that the data frame's shape is 20 by 11.

The result is a data frame consisting of the first 20 rows of sleepData, with the first six rows displayed onscreen. The text at the bottom of the data frame shows you how many rows you’re not seeing, as well as the data frame’s dimensions.

head() has a counterpart, tail(), which returns a new data frame made up of the last n rows of the original data frame. As with head(), the default value for n is 5.

Run the following in a new code cell:

val lastFew = sleepData.tail()
lastFew

The notebook will respond by creating and displaying lastFew, a data frame that contains the last 5 rows of sleepData:

A table displaying the last 5 rows of the sleepData data frame. Text at the bottom says that the data frame's shape is 5 by 11.

Extracting a slice() from the Data Frame

Suppose you weren’t interested in the first or last n rows of the data frame, but a section in the middle. That’s where the slice() method comes in. Given an integer range, it creates a new data frame consisting of the rows specified in the range.

Run the following in a new code cell:

val selection = sleepData.slice(30..34)
selection

This creates a new data frame, selection, and then displays its contents:

A table displaying rows 30 through 34 of the sleepData data frame. Text at the bottom says that the data frame's shape is 5 by 11.

If you looked at the row for “Human” and thought, “Who actually gets eight hours of sleep?!” keep in mind that these numbers are averages.

Learning that slice() Indexes Start at 1, not 0

Krangl borrows the DataFrame‘s slice()method from the dplyr library.

Dplyr is a library for R, where the first index of a collection type is 1. This is known as one-based array indexing.

In Kotlin, and most other programming languages, the first index of a collection type is 0.

krangl makes it easy for R programmers to make the leap to doing Kotlin sdata science. One of the ways to meet this goal would be to make krangl’s version of slice() work exactly like dplyr’s, right down to using 1 as the index for the first row in the data frame.

It’s time to do a little exploring.

Exploring the Data

Look back at the code cell where you used head() to see the first 5 rows of sleepData:

A table displaying the first 5 rows of the sleepData data frame. Text at the bottom says that the data frame's shape is 5 by 11.

The first animal in the list is the cheetah. Try using DataFrame’s rows property and rows elementAt() method to retrieve the row whose index is 0:

sleepData.rows.elementAt(0)

You should see this result:

{name=Cheetah, genus=Acinonyx, vore=carni, order=Carnivora, conservation=lc, sleep_total=12.1, sleep_rem=null, sleep_cycle=null, awake=11.9, brainwt=null, bodywt=50.0}

This makes sense as rows is a Kotlin Iterable and elementAt() is a method of Iterable. It makes sense that sleepData.rows.elementAt(0) would return an object representing the first row of sleepData.

Now, try approximating the same thing with slice(). The result will be of a different type: sleepData.rows.elementAt() returns a DataFrameRow, while slice() returns a whole DataFrame.

Run the following in a new code cell:

sleepData.slice(0..0)

The result will be an empty data frame; 11 columns, but no rows:

A table displaying the sleepData data frame with no rows - just column headings. Text at the bottom says that the data frame's shape is 0 by 11.

Now, try something else. Run the following in a new code cell:

sleepData.slice(1..1)

This time, you get a data frame with a single row; note the dimensions:

A table displaying the sleepData data frame with 1 row. Text at the bottom says that the data frame's shape is 1 by 11.

Try one more thing; run the following in a new code cell:

sleepData.slice(1..2)

This results in a data frame with two rows:

A table displaying the sleepData data frame with 2 rows. Text at the bottom says that the data frame's shape is 2 by 11.

From these little bits of code, you can deduce that slice(a..b) returns a data frame that:

  1. Starts with and includes row a from the original data frame.
  2. Ends with and includes row b from the original data frame.
  3. Counts rows starting from 1 rather than starting from 0.