Beginning Data Science with Jupyter Notebook and Kotlin

This tutorial introduces the concepts of Data Science, using Jupyter Notebook and Kotlin. You’ll learn how to set up a Jupyter notebook, load krangl for Kotlin and use it in data science utilizing a built-in sample data. By Joey deVilla.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 3 of 4 of this article. Click here to view the first page.

Filtering With Negation

Imagine you wanted a data frame containing the animals that aren’t herbivores? You may have noticed that there isn’t a notEqualTo or ne in the list of comparison functions.

krangl’s approach for cases like this is to extend BooleanArray to give it a not() method. When called, this method negates every element in the array: true values become false, and false values become true.

Enter the following in a new code cell to see how you can use not() to create a data frame of all the non-herbivores:

val nonHerbivores = sleepData.filter { (it["vore"] eq "herbi").not() }
nonHerbivores

This works because it["vore"] eq "herbi" creates a BooleanArray specifying which rows contain herbivores. Calling not() on that array inverts its contents, making it a BooleanArray specifying which rows do not contain herbivores.

The output shows some carnivores, insectivores, omnivores, and one animal for which the vore value was not provided, the vesper mouse.

Filtering With Multiple Criteria

In addition to not(), krangl extends BooleanArray with the AND and OR infix functions. Using AND, you could have created and displayed a dataset of heavy herbivores by running this code instead:

val alsoHeavyHerbivores = sleepData
 .filter { 
 (it["vore"] eq "herbi") AND 
 (it["bodywt"] gt 30) 
 }
alsoHeavyHerbivores

Fancier Text Filtering

In addition to seeing if the contents of a column are equal to, greater than or less than a given value, you can do some fancier text filtering with the help of DataCol’s isMatching() method.

This is another method that’s easier to demonstrate than explain, so run the following in a new code cell:

val monkeys = sleepData.filter { it["name"].isMatching<String> { contains("monkey") } }
monkeys

This will result in a data frame containing only those animals with “monkey” in their name. The lambda that you provide to isMatching() should accept a string argument and return a boolean. You’ll find functions such as startsWith(), endsWith(), contains() and matches() useful within that lambda.

Removing Columns

With filtering, you were removing rows from a data frame. You’ll often find that removing columns whose information isn’t needed at the moment is also useful. DataFrame has a couple of methods that you can use to create a new data frame from an existing one, but with fewer columns.

select() creates a new data frame from an existing one, but only with the columns whose names you specify.

Run the following code in a new cell to see select() in action:

val simplifiedSleepData = sleepData.select("name", "vore", "sleep_total", "sleep_rem")
simplifiedSleepData

You’ll see with a table that has only the columns whose names you specified: name, vore, sleep_total, and sleep_rem.

A table displaying the first 6 rows of the sleepData data frame, with only the name, vore, sleep_total and sleep_rem columns. Text at the bottom says that there are 77 more rows and that the data frame's shape is 83 by 4.

remove() creates a new data frame from an existing one, but without the columns whose names you specify.

Run the following code in a new cell to see remove() in action:

val evenSimplerSleepData = simplifiedSleepData.remove("sleep_rem")
evenSimplerSleepData

You’ll see this result:

A table displaying the first 6 rows of the sleepData data frame, with only the name, vore and sleep_total columns. Text at the bottom says that there are 77 more rows and that the data frame's shape is 83 by 3.

Performing Complex Data Operations

So far, you’ve been working with operations that work with observations on an individual basis. While useful, the real power of data science comes from aggregating data — collecting many units of data into one.

Calculating Column Statistics

DataCol provides a number of handy math and statistics methods for often-performed calculations on numerical columns.

The simplest way to demonstrate how they work is to show you these functions in action on a given column, such as sleep_total.

Run the following code in a new code cell to see these functions in action:

val sleepCol = sleepData["sleep_total"]
println("The mean sleep period is ${sleepCol.mean(removeNA=true)} hours.")
println("The median sleep period is ${sleepCol.median(removeNA=true)} hours.")
println("The standard deviation for the sleep periods is ${sleepCol.sd(removeNA=true)} hours.")
println("The shortest sleep period is ${sleepCol.min(removeNA=true)} hours.")
println("The longest sleep period is ${sleepCol.max(removeNA=true)} hours.")

These methods take a boolean argument, removeNA, which specifies whether to exclude missing values from the calculation. The default value is false, but it’s generally better to set this value to true.

Grouping

In the previous exercise, you calculated the mean, median, and other statistical information about the entire set of animals in the dataset. While those results might be useful, you’re likely to get more insights by doing the same calculations on smaller sets of animals with something in common. Grouping allows you to create these smaller sets.

Given one or more column names, DataFrame’s group() method creates a new data frame from an existing one, but organized into a set of smaller data frames. called groups. Each group consists of rows that had the same values for the specified columns.

Don’t worry if the previous paragraph confused you. This is yet another one of those situations where showing is better than telling.

Break sleepData‘s animals into groups based on the food they eat. There should be a carnivore group, an omnivore group, a herbivore group, etc. The vore column specifies what an animal eats, so you’ll provide that column name to the group() method.

Run the following in a new code cell:

val groupedData = sleepData.groupBy("vore")

This creates a new data frame named groupedData.

If you try to display groupedData‘s contents by entering its name into a new code cell and running it, the output will be confusing:

Grouped by: *[vore]
A DataFrame: 5 x 11
 name genus vore order conservation sleep_total sleep_rem
1 Cheetah Acinonyx carni Carnivora lc 12.1 
2 Northern fur seal Callorhinus carni Carnivora vu 8.7 1.4
3 Dog Canis carni Carnivora domesticated 10.1 2.9
4 Long-nosed armadillo Dasypus carni Cingulata lc 17.4 3.1
5 Domestic cat Felis carni Carnivora domesticated 12.5 3.2
and 4 more variables: awake, brainwt, bodywt

The output says that the data frame is grouped by vore and has 5 rows and 11 columns. The problem is that the default way to display the contents of a data frame works only for ungrouped data frames. You have to do a little more work to display the contents of a grouped data frame.

Run and enter the following in a new code cell:

groupedData.groups()

groups() returns the list of the data frame’s groups, and remember, groups, are just data frames.

You’ll see 5 data frames in a row. Each of these data frames contains rows with a common value in their vore column. There’s one group for “carni”, one called “omni”, and so on.

DataFrame has a count() method that’s very useful for groups; it returns a data frame containing the count of rows for each group.

Run and enter the following in a new code cell:

groupedData.count()

This is the result:

A table displaying a data frame that groups the sleepData animals by vore and displays the count for each group. Text at the bottom says that the data frame's shape is 5 by 2.

Any data frame operation performed on a grouped data frame is done on a per-group basis.

For example, sorting a grouped data frame creates a new data frame with individually sorted groups. You can see this for yourself by running the following in a new code cell:

val sortedGroupedData = groupedData.sortedBy("name")
sortedGroupedData.groups()