The View/Copy Headache in pandas without Copy on Write

The View/Copy Headache in pandas without Copy on Write#

After the last reading, you may be saying “ugh, but I don’t want to write pd.set_option("mode.copy_on_write", True) at the top of all my files!” I hear you. I don’t either. But allow me to explain what happens without Copy on Write (CoW) enabled:

Without Copy on Write, whether you get a view or a copy in pandas—and whether changes made to a view will propagate back to the original DataFrame—depends not only on the operations you execute (.loc, .iloc, etc.), but also on the structure of the data in the original DataFrame in ways that are, essentially, impossible to predict consistently.

Allow me to demonstrate.

The Problem: A Demonstration#

To illustrate how weird views and copies can be in pandas without CoW, let’s look at two examples of basically identical manipulations that result in very different behavior.

First, here is some code that takes a subset of a DataFrame and then modifies the data in the DataFrame. As we will see, this results in a change in the slice:

import pandas as pd

# This is the default option so I don't strictly
# need this command, but I'll add it to be explicit
pd.set_option("mode.copy_on_write", False)

df = pd.DataFrame({"a": [10, 20, 30, 40], "b": [11, 12, 13, 14]})
df
a b
0 10 11
1 20 12
2 30 13
3 40 14
my_slice = df.iloc[1:3,]
my_slice
a b
1 20 12
2 30 13
df.iloc[1, 1] = -1
df
a b
0 10 11
1 20 -1
2 30 13
3 40 14
my_slice
a b
1 20 -1
2 30 13

Voilà, the change to the DataFrame has propagated to the slice, so clearly, the slice was a view, right? Well… kinda.

Now observe what happens if we do the same operation, but now instead of changing the entry at df.iloc[1, 1] to -1, we change it to 3.14. You would assume the behavior of pandas would be unchanged, but…:

# Same DataFrame and subset:
df = pd.DataFrame({"a": [10, 20, 30, 40], "b": [11, 12, 13, 14]})
my_slice = df.iloc[1:3,]
my_slice
a b
1 20 12
2 30 13
# But now we set the value to 3.14 instead of -1.
df.iloc[1, 1] = 3.14
df
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_24865/2520817220.py:2: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '3.14' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  df.iloc[1, 1] = 3.14
a b
0 10 11.00
1 20 3.14
2 30 13.00
3 40 14.00
# And as you can see, in this instance
# the data in `my_slice` is unchanged.
my_slice
a b
1 20 12
2 30 13

(Why this happens isn’t actually important to understand, but for those who are interested: in the first modification, I replaced one integer with another, so that operation could be done in the existing integer array; in the second, I try to put a floating point number into an integer array. This can’t be done, so a new floating point array was created, and that new array replaced the old one as column a in the original DataFrame, breaking the “view” connection.)

Note that this behavior applies not just to row slices, but also column slices:

df
a b
0 10 11.00
1 20 3.14
2 30 13.00
3 40 14.00
# This initial change propagates
column_a = df["a"]
df.iloc[0, 0] = -42
column_a
0   -42
1    20
2    30
3    40
Name: a, dtype: int64
# But this does not
df.iloc[0, 0] = "a"
df
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_24865/673548005.py:2: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'a' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  df.iloc[0, 0] = "a"
a b
0 a 11.00
1 20 3.14
2 30 13.00
3 40 14.00
column_a
0   -42
1    20
2    30
3    40
Name: a, dtype: int64

Dealing with Views If You Don’t Want To Use CoW#

I won’t mince words: I think this behavior deeply problematic, I’ve long advocated for it to be changed, and I’m fully on the Copy on Write train. But if for some reason you think you don’t need it…

SettingWithCopyWarning

To help address this issue, pandas has a built-in alert system that will sometimes warning you when you’re in a situation that may cause problems, called the SettingWithCopyWarning, which you can see here:

import numpy as np

df = pd.DataFrame({"a": np.arange(4), "b": ["w", "x", "y", "z"]})
my_slice = df["a"]
my_slice
0    0
1    1
2    2
3    3
Name: a, dtype: int64
my_slice.iloc[1] = 2
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_24865/164737205.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  my_slice.iloc[1] = 2

Any time you see a SettingWithCopyWarning, go up to where the possible view was created (in this case, my_slice = df["a"]) and add a .copy():

my_slice = df["a"].copy()
my_slice.iloc[1] = 2

The Problem with SettingWithCopyWarning

The bad news is that the SettingWithCopyWarning will only flag one pattern where the copy-view problem crops up. Indeed, if you follow the link provided in the warning, you’ll see it wasn’t designed to address the copy-view problem writ large, but rather a more narrow behavior where the user tries to change a subset of a DataFrame incorrectly (we’ll talk more about that in our coming readings). Indeed, you’ll notice we didn’t get a single SettingWithCopyWarning until the section where we started talking about that warning in particular (and I created an example designed to set it off).

So: if you see a SettingWithCopyWarning do not ignore it—find where you may have created a view or may have created a copy and add a .copy() so the error goes away. But just because you don’t see that warning doesn’t mean you’re in the clear!

Which leads me to what I will admit is an infuriating piece of advice to have to offer: if you take a subset for any purpose other than immediately analyzing, you should add .copy() to that subsetting. Seriously. Just when in doubt, .copy().