The Object and Categorical Data Types

The Object and Categorical Data Types#

In our previous readings, we introduced the idea that not only can DataFrames and Series hold any of the numeric data types we’ve come to know and love from numpy — like float64 or int64 — but that they can also hold arbitrary Python objects in an object-type Series. Moreover, we also acknowledged (though didn’t go into) the existence of another relatively unique pandas data type — the Categorical Series.

In this reading, we will discuss the purpose of these two data types, as well as their advantages and disadvantages.

The object dtype#

The object type Series gives pandas incredible flexibility as it allows any type of data to be stored in a table. The most common use of the object data type is to store text — for example, names, addresses, written answers, etc. — but the flexibility can also be used for applications like geospatial analysis, in which a single row of a DataFrame may represent a single country, the columns may represent features of the country (name, income, population), and the last column stores geometric Python objects that describe the shape and location of the country.

But this flexibility also comes at a cost — performance and memory efficiency.

The object Performance Penalty#

To understand why object Series are slow, it helps to first discuss why numeric Series (and numeric numpy arrays) are fast. When you work with a numeric pandas Series or numpy array, all of the entries in those arrays live in the same place in memory (in your computer’s RAM). This is possible because all those integers (or all those floating point numbers) are written with the same number of 0s and 1s. In an int64 Series, for example, every integer is represented by 64 1s and 0s. This makes it easy for the computer to lay them out sequentially. Moreover, it makes it possible for the computer to find specific entries quickly, since it knows that the third integer in the Series will start 64 * 2 spots from the start of the Series and end 64 spots later.

But an object Series is a little different. Python objects vary in size — some may only take up 128 0s and 1s, while others may require thousands — and so the actual data in an object Series can’t be laid down in a nice regularly spaced sequence. Instead, every entry in an object Series gets put in a different location in your computer’s memory (RAM), and only the address of that information is placed in a nice organized Series. These addresses are all the same size, and so the addresses can be organized in a regular manner, even if the actual content you want to store is irregular.

The cost of this arrangement is that if you ask for the second entry of an object Series (e.g., my_series.iloc[1]), you’re computer has first go to the second location in the array, read the address stored there, then go to that address to find the actual content you want. And those added steps waste time.

The other problem with object Series is that because they can store anything, Python doesn’t know before it looks up an entry whether it will find a string, an integer, or a Python set. As a result, when it sees code like:

my_array * 2

Python can’t be sure what is meant by * — it could mean “do integer multiplication” (if a given entry in my_array is an integer), but it might also mean “double up the list you find” (if the entry is a Python list)!

Indeed we can see this if we make a numpy object array full of ints and compare it to a numpy integer array. The both have the same content, but they are organized in memory differently:

import pandas as pd
import numpy as np

object_numbers = pd.Series(np.arange(1000000), dtype="object")
numbers = pd.Series(np.arange(1000000), dtype="int64")

%timeit object_numbers * 2

15.9 ms ± 70.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit numbers * 2

771 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

See? Same operation (doubling each entry of arrays with the integers from 0 to 1,000,000), but the object array operation is ~20x slower.

So yes, object dtypes are absolutely wonderful and introduce unbelievable flexibility to pandas; but remember there is a cost to using them, so stick to numeric data types when possible!

Categoricals#

The category data type is a delightful little hack that allows us to avoid most of the problems with object Series in certain circumstances.

To illustrate how category Series work, suppose we have a DataFrame with information on hundreds of thousands of customers in the United States, and that one column of that DataFrame contains the name of each customer’s state of residence (substitute Province or any other sub-national administrative unit if states don’t resonate for you).

Because those state names are words, they are being stored in an object Series. That, in turn, means that Python has created hundreds of thousands of Python objects — each containing the name of a customer’s state — and a vector containing addresses for each of those objects.

But as there are only 50 states in the United States, this might strike you as absurd, since most of those hundreds of thousands of Python objects are holding the same text! Surely we can do something more efficient than that?

Enter Categoricals. The idea of a category Series is to take a object Series that contains frequently repeated strings and:

Replace each unique string with a number (for example, "Colorado" could become 1 and "Tennessee" could become 2), and
Create a small “lookup table” that keeps track of what string is associated with each number.

Now pandas doesn’t need to create hundreds of thousands of Python strings to record each customer’s state, it can just make a numeric Series with state names replaced by the numbers 1-50, and a new small vector with the fifty names of states.

Moreover, in addition to saving memory, this also dramatically improves the performance of pandas. Suppose, for example, you want to subset for customers living in North Carolina. When these states were in an object Series, pandas would have to go to each entry, figure out where the associated Python object is stored, get that object, and check to see if it was the string "North Carolina".

But now, pandas can just go to the lookup table, see that "North Carolina" is the 33rd entry (and so is represented by the number 33 in our encoded Series of numbers), and look for values of 33 in that Series. Hooray!

But the best part is that, in most cases, the fact your data has been split into a numeric vector and a lookup table is actually entirely hidden from you, the user. For most operations, using a Categorical Series is just like using an object Series, just faster.

Categorical Series in Practice#

To illustrate how one works with Categorical Series, let’s make a toy version of this customer dataset:

import pandas as pd
import numpy as np

customers = pd.DataFrame(
    {
        "customer": ["Bob", "Aditya", "Francisco", "Shufan"],
        "state": ["Colorado", "Tennessee", "Colorado", "Virginia"],
    }
)
customers

	customer	state
0	Bob	Colorado
1	Aditya	Tennessee
2	Francisco	Colorado
3	Shufan	Virginia

customers.dtypes

customer    object
state       object
dtype: object

As we can see, state begins its life as a standard object Series, but we can convert it to a Categorical with .astype("category"):

customers["state"] = customers["state"].astype("category")
customers

	customer	state
0	Bob	Colorado
1	Aditya	Tennessee
2	Francisco	Colorado
3	Shufan	Virginia

As you can see, at first glance nothing about this column has changed. But if we pull it out, you can see it’s dtype is category and that the Categories associated with the Series (the lookup table) contains three values: "Colorado", "Tennessee" and “Virginia":

customers["state"]

0     Colorado
1    Tennessee
2     Colorado
3     Virginia
Name: state, dtype: category
Categories (3, object): ['Colorado', 'Tennessee', 'Virginia']

And if you want to, you can see the two underlying pieces directly:

customers["state"].cat.categories

Index(['Colorado', 'Tennessee', 'Virginia'], dtype='object')

customers["state"].cat.codes

  0
  1
  0
  2
dtype: int8

But as we said, in most cases category Series will operate just like object Series. Subsetting, for example, will work just as it would with an object Series:

customers.loc[customers["state"] == "Colorado"]

	customer	state
0	Bob	Colorado
2	Francisco	Colorado

The only place that problems may arise is that one cannot make arbitrary edits to a category Series — if you try and set a cell to have a value that isn’t in the current Categories table, you will get an error:

customers.loc[customers["state"] == "Colorado", "state"] = "Kansas"

TypeError                                 Traceback (most recent call last)
/Users/nce8/github/practicaldatascience_book/notebooks/class_3/week_2/37_object_and_categorical_dtypes.ipynb Cell 22 line 1
----> 1 customers.loc[customers["state"] == "Colorado", "state"] = "Kansas"

[...]

TypeError: Cannot setitem on a Categorical with a new category (Kansas), set the categories first

You can add novel values, to be clear, you just have to add the category first:

customers["state"] = customers["state"].cat.add_categories(["Kansas"])
customers["state"]

0     Colorado
1    Tennessee
2     Colorado
3     Virginia
Name: state, dtype: category
Categories (4, object): ['Colorado', 'Tennessee', 'Virginia', 'Kansas']

customers.loc[customers["state"] == "Colorado", "state"] = "Kansas"
customers

	customer	state
0	Bob	Kansas
1	Aditya	Tennessee
2	Francisco	Kansas
3	Shufan	Virginia

Why Not Always Use Categoricals?#

Categoricals are great, but they are only useful when your object Series has a relatively small number of unique values. If you tried to convert an object Series with hundreds of thousands of addresses — and nearly all of them were unique — into a category Series, then pandas would have to create a lookup table that had… hundreds of thousands of unique entries (essentially, it would just be recreating your original object Series). And so there would be no real performance benefit.