Views and Copies#

Before we dive into matrices, there’s one nuance to how vectors (and indeed all numpy arrays) work that we need to cover: how numpy manages memory allocation when we take subsets of arrays.

Because this reading relates to the nuances of how numpy is actually managing how it writes 1s and 0s in memory, it may seem a little intimidating at first—don’t worry! This topic is definitely a little mind-bending, and many learners may need to read this more than once to develop a good understanding of what’s going on, and most learners won’t develop an intuitive sense of what’s going on until they have been working with numpy for a while. But this isn’t the start of a big transition of the course into abstract computer science theory—we will get back to working with data quickly. This is just one kinda esoteric topic that numpy users have to wrestle with that we can’t avoid introducing around here.

Memory Allocation & Subsetting#

In our previous reading, we talked about how we could not just look at subsets of vectors, but also store those subsets in a new variable. For example, we could pull the middle entry of one vector and assign it to a new vector:

import numpy as np

a = np.array([42, 47, -1])
a
array([42, 47, -1])
new = a[1]
new
47

At the time, we illustrated what was happening with this set of pictures:

vector_subsetting1 vector_subsetting2 vector_subsetting3

And that was close to the truth about what was going on, but it wasn’t quite the full truth.

The reality is that when we create a subset in numpy and assign it to a new variable, what is actually happening is not that the variable is being assigned a copy of the values being the subset, but rather the variable is being assigned a reference to the subset, something that looks more like this:

Views and copies reference

When numpy creates a reference to a subset of an existing array, that reference is called a view, because it’s not a copy of the data in the original array, but an easy way to refer back to the original array—it provides a view onto a subset of the original array.

Why is this distinction important? It’s important because it means that both variables—a and new are actually both referencing the same data, and so changes made to one variable propagate to the other.

To illustrate in more detail, let’s create two new vectors: my_vector and my_subset, where my_subset (as the name implies) is just a subset of my_vector:

my_vector = np.array([1, 2, 3, 4])
my_vector
array([1, 2, 3, 4])

Views and copies reference

my_subset = my_vector[1:3]
my_subset
array([2, 3])

Now suppose we change the first entry of my_subset to -99:

my_subset[0] = -99

Since the first entry in my_subset is just a reference to the second entry in my_vector, the change I made to my_subset will also propagate to my_vector:

my_vector
array([  1, -99,   3,   4])

Views and copies reference

Note that because my_subset is a view of the second and third elements of my_vector, our change to the first element of my_subset resulted in a change to the second element of my_vector.

And just as edits to my_subset propagate to my_vector, edits to my_vector will also propagate to my_subset. Suppose we changed the second entry of my_vector to 42. What would happen to my_subset? Which entry would change?

Well if we return to our illustration, we see:

Views and copies reference

And indeed we can confirm the change occurs to the second element of my_subset:

my_vector[2] = 42
my_subset
array([-99,  42])

Language and Symmetry#

It’s worth pausing for a moment to point out a bit of a problem with the language of views and copies. It is common, in numpy circles, to look at the example above and talk about my_vector being the original data and my_subset as a view. And it is true that, because my_vector came first, there is a difference between my_vector and my_subset in terms of how numpy is creating and managing these objects.

But from your perspective as a user, it is important to recognize that there is a symmetric dependency between my_vector and my_subset in the example above. Yes, one may be “the original,” but once a view has been created, changes to either array have the potential to propagate to the other: changes to the my_subset may resultant changes to my_vector, and changes to my_vector can impact the my_subset (if they impact the portion of the array referenced by the subset).

So when you think about views, always remember that what we’re talking about is multiple objects sharing the same data, even if we tend to only talk about one of our arrays as “a view.”

Why? Why Would numpy Do This?!#

It is not uncommon for students to feel a little betrayed by numpy when they are first introduced to this behavior. “Why,” they ask, “why would numpy do something that makes it so much harder to keep track of the consequences of changes I make to my data?”

The short answer, as with most things in numpy, is that it’s all about speed. Creating a new copy of the data contained in the subset of a vector takes time (it literally requires your computer to write lots of 1s and 0s to memory), and so creating views instead of copies makes numpy faster.

How much faster? The short answer is a lot faster. The longer answer? Well, let’s talk a little more about how views and copies work, then in our next reading, we can do an experiment to measure the speed difference below.