Subsetting and Indexing Series

Subsetting and Indexing Series#

Subsetting Series#

Extracting a subset of elements from a Series is an extremely important task, not least because it generalizes nicely to working with ever-larger datasets (which are at the heart of data science). This process—whether applied to a Series or a dataset—is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill that is key for enhancing your data science skills quickly, it’s this, because this allows you to get your data into the right format for processing as quickly as possible.

In pandas, there are three ways to filter a Series: using a separate logical Series, using row-number indexing, and using index labels. I tend to use the first method most, but all three are useful. The first and second of these you will recognize from numpy arrays, while the last one (since it uses index labels, which only exist in pandas) is unique to pandas.

Subsetting using row-number indexing#

A different way to subset a Series is to specify the row-numbers you want to keep using the iloc function. (iloc stands for “integer location,” since row numbers are always integers). This will give you the behavior you’re more familiar with from numpy. Just remember that, as in all of Python, the first row is numbered 0!

import pandas as pd

fruits = pd.Series(["apple", "banana"])
fruits.iloc[0]

'apple'

You can also subset with lists of rows, or ranges, just like in numpy:

fruits.iloc[[0, 1]]

0     apple
1    banana
dtype: object

fruits.iloc[0:2]

0     apple
1    banana
dtype: object

Subsetting using index values#

Lastly, we can subset our rows using the index values associated with each row using the loc function.

attendance = pd.Series(
    [132, 94, 112, 84, 254, 322, 472],
    index=[
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
    ],
)

attendance.loc["Monday"]

You can also ask for ranges of index labels. Note that unlike in integer ranges (like the 0:2 we used above to get rows 0 and 1), index label ranges include the last item in the range. So for example, if I ask for .loc["Monday":"Friday"], I will get Friday included, even if .iloc[0:2] doesn’t include 2.

attendance.loc["Monday":"Friday"]

Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
dtype: int64

Subsetting with Logical Tests#

Let’s jump right into an example, using our Zoo attendance Series:

attendance = pd.Series(
    [132, 94, 112, 84, 254, 322, 472],
    index=[
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
    ],
)
attendance

Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Suppose we want to only get days with at least 100 people attending. We can subset our Series by using a simple test to build a Series of Booleans (True and False values), then asking pandas for the rows of our Series for which the entry in our test Series is True:

was_busy = attendance > 100
was_busy

Monday        True
Tuesday      False
Wednesday     True
Thursday     False
Friday        True
Saturday      True
Sunday        True
dtype: bool

busy_days = attendance.loc[was_busy]
busy_days

Monday       132
Wednesday    112
Friday       254
Saturday     322
Sunday       472
dtype: int64

We can summarize these methods in the figure below:

There is one really important distinction between how subsetting works in pandas and most other languages though, which has to do with indices. Suppose we want to subset a Series with fruits to only get the entry “apple”. We could do the following:

fruits = pd.Series(["apple", "banana"])
apple_selector = pd.Series([True, False])
fruits.loc[apple_selector]

0    apple
dtype: object

Just like in numpy.

But…

Just as pandas uses index alignment to match entries based on index values when you do an operation between two Series (e.g., first_series * second_series), it also uses index alignment for subsetting with logical tests and .loc.

In the case above, because we did not specify indices for either fruits or apple_selector, they both got the usual default index values of their initial row numbers. But let’s see what happens if we change their indices so that they don’t match their order:

fruits  # We can leave fruits as they are

0     apple
1    banana
dtype: object

apple_selector = pd.Series([True, False], index=[1, 0])
apple_selector

1     True
0    False
dtype: bool

Note that we’ve flipped the index order for apple_selector: the first row has index value 1, and the second row has value 2. Now watch what happens when we put apple_selector in square brackets:

fruits.loc[apple_selector]

1    banana
dtype: object

We get banana! That’s because in apple_selector, the index value associated with the True entry is 1, and the row of fruit that had index value 1 was banana, even though they are in different rows.

But note this only happens if your Boolean array is a Series (and thus has an index). If you pass a numpy Boolean array or a list of Booleans (neither of which have a concept of an index), then despite using loc, alignment will be based on row numbers, not index values (because there are no index values to align).

fruits.loc[[True, False]]

0    apple
dtype: object

UGH I know. If I wrote pandas, this would not work and this would throw an exception. But that’s how it is.

While this distinction between row numbers and index values is important, though, it comes up less often than you’d think. That’s because we don’t usually subset by creating a new Series by hand for subsetting; normally, we build our Boolean Series by executing a test on the Series we’re using. And when we do that, the new Series of Booleans will have the same index as the old Series, so they align naturally.

Look back at our example of was_busy: you’ll see that it automatically ended up with the same index as our original Series, attendance. As a result, the first row of our Boolean Series will have the same index value as the first row of our original Series, the second row of our Boolean Series will have the same index value as the second row of our original Series, and so on. As a result, there’s no difference between matching on row order and matching on index value. But it does occassionally come up (like if you tried to re-sort one of these), so keep it in mind!

Summary#

Being able to select the data you need for a given analysis is a foundational skill to develop. Having the programming proficiency to be able to do this quickly will significantly reduce the time you need to prepare your data for analysis. There are three primary methods of accessing and filtering data: logical indexing, row-number indexing (e.g. iloc), and index labels, and together this toolkit can enhance your ability to access and filter data. Next, you’ll explore an exercise for trying out these skills yourself.