{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Indices (Index Labels)\n", "\n", "One of the defining features of pandas data structures is that all rows and columns come with *labels*. Index labels (the labels assigned to rows of the data) are always present in both Series and DataFrames and can be accessed through the `.index` attribute (e.g., `my_dataframe.index`), and column labels are always present in DataFrames and can be accessed through the `.columns` attribute (e.g., `my_dataframe.columns`).\n", "\n", "The existence of column names, I think most people would agree, is an unequivocal strength of pandas — it makes it easy to remember the *substantive* meaning of the different variables in your data, and to write code that very explicitly states what variables are being manipulated (as opposed to working with column numbers, as one does with numpy). \n", "\n", "The value of all pandas rows having index labels, however, is less clear. Actually, strike that. We're comfortable saying that **the fact all pandas rows come with index labels is one of the most confusing and least useful features of pandas.** And so in this reading, while we will review the basics of how these index labels work, we our primary focus will actually be to help you learn how to prevent them from getting in the way.\n", "\n", "This is, admittedly, a very opinionated take on index labels, but I think that the fact that pandas is the *only* tabular data library we know of that has this feature — including many libraries developed since pandas that could have adopted this approach if they'd wanted — is illustrative of the fact that most people don't see this as a particularly valuable addition to tabular data structures like Series and DataFrames.\n", "\n", "## Vocabulary, A Review\n", "\n", "As noted in our readings on Series and DataFrames, **in this course** we will try to always make use of two distinct terms for the different ways to identify rows in a pandas data structure:\n", "\n", "- *Index labels*: these are labels for rows that can be accessed through the `.index` attribute (e.g., `my_series.index`). To access a row using index labels, one uses the `.loc[]` operator. Note that what *we* call index labels are usually just referred to as the \"index\" in most documentation about pandas.\n", "- *Row numbers*: these are the numbers one would naturally assign to the rows of a Series or DataFrame based on the relative order in which they appear, starting with row 0 (since this is Python), then moving to row 1, row 2, etc. These row numbers operate the same way row numbers operate in numpy (or in other languages, like R, Julia, Java, etc.), and you can access rows by row number using `.iloc[]`. \n", "\n", "While these seem like obviously distinct concepts in the abstract, things quickly become confusing because the *default* index labels pandas assigns to rows is... their row numbers. 🤦‍♂️ As a result, when you load a new dataset, if you don't set the index explicitly, each row's index labels and row numbers will be the same:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregiongdppcap08polityIV
0BrazilS. America1029618
1GermanyW. Europe3561320
2MexicoN. America1449518
3MozambiqueAfrica85516
4RussiaC&E Europe1613917
5UkraineC&E Europe727116
\n", "
" ], "text/plain": [ " country region gdppcap08 polityIV\n", "0 Brazil S. America 10296 18\n", "1 Germany W. Europe 35613 20\n", "2 Mexico N. America 14495 18\n", "3 Mozambique Africa 855 16\n", "4 Russia C&E Europe 16139 17\n", "5 Ukraine C&E Europe 7271 16" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "small_world = pd.read_csv(\"data/world-very-small.csv\")\n", "small_world" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, for example, Mexico's index label is `2`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country Mexico\n", "region N. America\n", "gdppcap08 14495\n", "polityIV 18\n", "Name: 2, dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small_world.loc[2, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*And* its row number is `2`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country Mexico\n", "region N. America\n", "gdppcap08 14495\n", "polityIV 18\n", "Name: 2, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small_world.iloc[2, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the fact that index labels and row numbers *start off* as being the same does not mean they are guaranteed to stay that way. That's because when one sorts a Series or DataFrame, index labels *stay with the observation* with which they were originally associated, while row numbers always depend on the relative order in which observations appear.\n", "\n", "Thus if I sort our DataFrame by GDP per capita (thus shuffling the order of rows), you will see that index label `2` has followed Mexico to the fourth row (row number 3, since we count from 0):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregiongdppcap08polityIV
3MozambiqueAfrica85516
5UkraineC&E Europe727116
0BrazilS. America1029618
2MexicoN. America1449518
4RussiaC&E Europe1613917
1GermanyW. Europe3561320
\n", "
" ], "text/plain": [ " country region gdppcap08 polityIV\n", "3 Mozambique Africa 855 16\n", "5 Ukraine C&E Europe 7271 16\n", "0 Brazil S. America 10296 18\n", "2 Mexico N. America 14495 18\n", "4 Russia C&E Europe 16139 17\n", "1 Germany W. Europe 35613 20" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_small_world = small_world.sort_values(\"gdppcap08\")\n", "sorted_small_world" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This illustrates a key feature of index labels: in that they stay with a given observation regardless of how the data is sorted, they are like any other column of the DataFrame. The only difference is that the *syntax* for manipulating DataFrames using the index labels is different from the syntax for other columns. Want to sort by any column other than the index? Use `.sort_values(\"name_of_column\")`. Want to sort by the index labels? `.sort_index()`. Want to get rows where the column `polityIV == 16`? `small_world[small_world.polityIV == 16]`. Want to get the rows where the index is equal to 16? `small_world.loc[16]`. \n", "\n", "And that, in a nutshell, is the problem: index labels make one column of data special, and in the process, it forces users to write code to access that column in a manner that's different from any other column. And for nearly all use cases, for no good reason (we'll touch on the \"why\" argument defenders of index labels would make below). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Minimalist Strategy for Working with Index Labels\n", "\n", "**The best advice we can give you for working with index labels is to ignore them whenever you can, and if you do happen to need the data in index labels, just copy the data in your index labels over into a normal column and work with it as you would any other variable.**\n", "\n", "To illustrate, suppose you have a dataset with data you actually care about stored as index labels. This can happen — even if you don't want it to! — as a side effect of some functions we will learn about next week (like `.groupby()` or `.pivot()`). To illustrate, though, we will just force pandas to use the GDP per capita column as the index when we load our data using the `index_col` argument:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregionpolityIV
gdppcap08
10296BrazilS. America18
35613GermanyW. Europe20
14495MexicoN. America18
855MozambiqueAfrica16
16139RussiaC&E Europe17
7271UkraineC&E Europe16
\n", "
" ], "text/plain": [ " country region polityIV\n", "gdppcap08 \n", "10296 Brazil S. America 18\n", "35613 Germany W. Europe 20\n", "14495 Mexico N. America 18\n", "855 Mozambique Africa 16\n", "16139 Russia C&E Europe 17\n", "7271 Ukraine C&E Europe 16" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data_in_index = pd.read_csv(\"data/world-very-small.csv\", index_col=\"gdppcap08\")\n", "world_data_in_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously, GDP per capita is data we will probably want to use, so we want to move the data into a normal column. We can do this with the `.reset_index()` method:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppcap08countryregionpolityIV
010296BrazilS. America18
135613GermanyW. Europe20
214495MexicoN. America18
3855MozambiqueAfrica16
416139RussiaC&E Europe17
57271UkraineC&E Europe16
\n", "
" ], "text/plain": [ " gdppcap08 country region polityIV\n", "0 10296 Brazil S. America 18\n", "1 35613 Germany W. Europe 20\n", "2 14495 Mexico N. America 18\n", "3 855 Mozambique Africa 16\n", "4 16139 Russia C&E Europe 17\n", "5 7271 Ukraine C&E Europe 16" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_data = world_data_in_index.reset_index()\n", "world_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that's it! The data has just been moved over into a normal column, and we can work with it the way we normally would and once again ignore our index labels. \n", "\n", "(Note that index labels don't always have a name, in which case they will become a column called `\"index\"` and you may have to rename the column)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Index Alignment\n", "\n", "As we said above, in our view the best way to deal with index labels is to ignore them, and if you somehow end up with data stored as index labels, move that data back to a normal column with `.reset_index()`.\n", "\n", "But there is one pandas behavior related to index labels all users must be aware of, even those of us who try and avoid indexes as much as possible: **index alignment.**\n", "\n", "In almost any library that supports arrays, when you try to interact two vectors or arrays (like adding them together), observations are matched up by the order they appear. For example, in numpy, if I add two vectors of length three, the first entry of the first vector will be added to the first entry of the second vector, the second entry will be added to the second entry, etc.: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5, 7, 9])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "my_vector1 = np.array([1, 2, 3])\n", "my_vector2 = np.array([4, 5, 6])\n", "\n", "my_vector1 + my_vector2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But if you attempt to interact two pandas Series, observations will be matched up not by their row numbers, but by index labels, as shown in this figure:\n", "\n", "![Indices Figure](img/3.3.7-indices.png)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 6\n", "b 6\n", "c 9\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.Series([1, 2, 3], index=[\"a\", \"b\", \"c\"])\n", "df2 = pd.Series([4, 5, 6], index=[\"b\", \"a\", \"c\"])\n", "\n", "df1 + df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is... bananas. We hate this behavior. If you know what a `join` or a `merge` are, index labels are doing an implicit outer join (if you don't know about those, don't worry — we'll cover them next week!). But this behavior exists, so you need to know about it.\n", "\n", "Moreover, if the Series' index labels don't fully overlap, the result — say the second Series doesn't have an index label of `\"c\"`, pandas will generate a result that has all possible index label values (the union of the two index labels) with data just missing (the value will be `NaN`, or Not a Number, which we'll discuss more later this week):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 6.0\n", "b 6.0\n", "c NaN\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.Series([1, 2, 3], index=[\"a\", \"b\", \"c\"])\n", "df2 = pd.Series([4, 5], index=[\"b\", \"a\"])\n", "\n", "df1 + df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Does This Come Up a Lot?\n", "\n", "OK, now the good news: it is pretty rare this comes up. *Most* of the time when users are manipulating a pair of Series, they are different variables from the same DataFrame. For example, dividing GDP by population to get GDP per capita:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryGDPpopulation
0United States26949643334233854
1Argentina62183347327407
\n", "
" ], "text/plain": [ " country GDP population\n", "0 United States 26949643 334233854\n", "1 Argentina 621833 47327407" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world = pd.DataFrame(\n", " {\n", " \"country\": [\"United States\", \"Argentina\"],\n", " \"GDP\": [26_949_643, 621_833],\n", " \"population\": [334_233_854, 47_327_407],\n", " }\n", ")\n", "world" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryGDPpopulationgdp_per_cap
0United States269496433342338540.080631
1Argentina621833473274070.013139
\n", "
" ], "text/plain": [ " country GDP population gdp_per_cap\n", "0 United States 26949643 334233854 0.080631\n", "1 Argentina 621833 47327407 0.013139" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world[\"gdp_per_cap\"] = world[\"GDP\"] / world[\"population\"]\n", "world" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And since all Series in a DataFrame share a common index, observations will line up the way you would expect when doing this." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 26949643\n", "1 621833\n", "Name: GDP, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world[\"GDP\"]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 334233854\n", "1 47327407\n", "Name: population, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world[\"population\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it's an issue that *can* arise, so it's important you know about it!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## So... If They're Such A Pain, Why Do Index Labels Exist?\n", "\n", "I think index labels came into existence out of a well-meaning desire to make life easier for users by offering two special features for free. The first is index alignment. By *implicitly* doing a merge/join of pandas Series, I think the idea was that it would make life easier by not requiring users to do their merges explicitly. But... sometimes it's better to have simple, predictable behaviors (matching up row 1 to row 1, etc.), and to let people do merges explicitly when they want.\n", "\n", "The second is that pulling out rows using `.loc[]` and an index label (e.g., `small_world.loc[16]`) is faster than subsetting from a regular column (e.g., `small_world.loc[small_world[\"polityIV\"] == 16]`). That's because index labels are basically like a Python dictionary — getting values using an index label is an O(1) operation (if you recall that discussion from previous weeks), while subsetting based on regular column values is O(N).\n", "\n", "The problem, however, is that grabbing an arbitrary row really fast isn't a common data science use case, while index labels are present in all pandas data structures. Having a way to get an arbitrary row quickly is a nice feature, and one other libraries have implemented as an *optional* feature. But baking it into all data structures just complicates everything to accommodate something most people never do with pandas." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }