Subsetting DataFrame Tricks and Gotchas#

Two features of subsetting DataFrames are worth special attention: subsetting with simple square brackets ([]), and subsetting columns with dot-notation.

import pandas as pd

world = pd.read_csv(
    "https://raw.githubusercontent.com/nickeubank/"
    "practicaldatascience/master/Example_Data/world-small.csv"
)
world
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0
2 Angola Africa 5899 8.0
3 Argentina S. America 14333 18.0
4 Armenia C&E Europe 6070 15.0
... ... ... ... ...
140 Venezuela S. America 12804 16.0
141 Vietnam Asia-Pacific 2785 3.0
142 Yemen Middle East 2400 8.0
143 Zambia Africa 1356 15.0
144 Zimbabwe Africa 188 6.0

145 rows × 4 columns

[] Square brackets#

As with Series, single square brackets in pandas change their behavior depending on the values you pass them. Again, it is worth emphasizing that there is nothing that one can do with square brackets that you can’t do with .loc and .iloc, so if they seem too strange, you don’t have to use them.

With that said, as summarized below, [] is actually much safer on DataFrames than on Series.

The rules of [] in DataFrames are:

  • If your entry is a single column name, or a list of column names, it will return those columns.

  • If your entry is a slice, it will work like iloc and select rows based on row order.

  • If your entry is a Boolean array, and of exactly the same length as the number of rows in your data, it will subset rows.

# Select one column
world["country"].head()
0      Albania
1      Algeria
2       Angola
3    Argentina
4      Armenia
Name: country, dtype: object
# Select multiple columns
world[["country", "gdppcap08"]].head()
country gdppcap08
0 Albania 7715
1 Algeria 8033
2 Angola 5899
3 Argentina 14333
4 Armenia 6070
# Boolean test
world[world["gdppcap08"] > 10000].head()
country region gdppcap08 polityIV
3 Argentina S. America 14333 18.0
5 Australia Asia-Pacific 35677 20.0
6 Austria W. Europe 38152 20.0
8 Bahrain Middle East 34605 3.0
10 Belarus C&E Europe 12261 3.0
# Slice of rows
world[0:3]
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0
2 Angola Africa 5899 8.0

My advice on using [] on DataFrames#

[] is much safer on DataFrames because the situation we saw with Series where [] might subset on index labels (if your index labels are integers) or it might subset on row order (if your index labels are not integers) doesn’t exist. Moreover, selecting a single column is extremely common, and this is a case where I use single square brackets all the time.

In a Series, if I pass 0, it’s always unclear whether that’s going to get me the first row (row-order-based) or the row with index value 0 (if I have integer index values).

On a DataFrame, a single entry or list of entries will only attempt to match columns based on index values, and if that fails, it throws an exception rather than defaulting to acting like .iloc:


world[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

Similarly, Boolean subsetting always acts like you’re using .loc (aligning on index values where it can, row order if it can’t), and slices in [] always get behavior like .iloc, making behavior much more predictable.

Getting Columns with Dot-Notation#

In addition to passing the name of a column into .loc or to [], columns can also sometimes be access using dot-notation:

world.country.head()
0      Albania
1      Algeria
2       Angola
3    Argentina
4      Armenia
Name: country, dtype: object

This method of getting columns is very easy and intuitive (given how often we use dot-notation in Python more broadly), but it has a couple significant pit-falls:

  • Only works for column names without spaces or punctuation

  • You can’t pass a variable to dot-notation, you have to write out the column explicitly (so you can’t write generalized code).

  • Only works if the column name isn’t the same as an existing method (e.g., df.count will call the count method, even if you have a column named "count")

  • Can cause problems if you try to put it on the left side of the equals sign.

Of these, the reasons for the first and second aren’t complicated, but the third and fourth concerns bear exploring.

Suppose we added a column to our data called rank that gave each country’s GDP rank (this code is a little convoluted because there is an easier way to do this, but this works):

world = world.sort_values("gdppcap08")
world["rank"] = range(0, len(world))
world.head()
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

Now watch what happens if we try to access the rank column with dot-notation—we don’t get the column, we get the method rank (that’s what bound method NDFrame.rank at the top of this output means — it’s returning the method, then telling us about the object—our DataFrame—the method is associated with):

world.rank
<bound method NDFrame.rank of             country        region  gdppcap08  polityIV  rank
144        Zimbabwe        Africa        188       6.0     0
29   Congo Kinshasa        Africa        321      15.0     1
76          Liberia        Africa        388      10.0     2
53    Guinea-Bissau        Africa        538      11.0     3
40          Eritrea        Africa        632       3.0     4
..              ...           ...        ...       ...   ...
62          Ireland     W. Europe      44200      20.0   140
137   United States    N. America      46716      20.0   141
114       Singapore  Asia-Pacific      49284       8.0   142
98           Norway   Scandinavia      58138      20.0   143
107           Qatar   Middle East      85868       0.0   144

[145 rows x 5 columns]>

Similar issues arise if you try to assign a column using dot-notation on the left side of the assignment operator. For example, suppose we want to shift everyone’s rank up by 1:

# We make an assignment to what we *think* is
# the `rank` column
world.rank = world["rank"] + 1
world.head()
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

It fails silently because what you’ve actually done is over-written the method rank with the column rank plus 1. Now now only has your rank column not changed (see it still starts with 0), but now you’ve broken the rank method:


world.rank()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 world.rank()

TypeError: 'Series' object is not callable

And finally, if you try to create a column using dot-notation on the left-hand side of the assignment operator, you will also get into trouble:

world.rank_doubled = range(0, 2 * len(world), 2)
world.head()
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_22305/4041630850.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  world.rank_doubled = range(0, 2 * len(world), 2)
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

See now rank_doubled wasn’t added to your DataFrame? It just disappears. pandas does now raise a warning, but warnings don’t stop your code from running, so if you don’t see it, you can corrupt your data.

My advice on dot-notation:

  • Just don’t use dot-notation on the left-hand side of the assignment operator. You’re begging for trouble, and just making that a rule means you don’t have to worry about when you might be causing a conflict with a method.

  • Try not to use it on the right side of the assignment operator. It’s safer than using it on the left side of the assignment operator, but none of us will ever memorize all the names of methods in pandas, and if your column happens to have the same name as a method, you may not notice the error.