{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas Indices (Index Labels)\n",
    "\n",
    "One of the defining features of pandas data structures is that all rows and columns come with *labels*. Index labels (the labels assigned to rows of the data) are always present in both Series and DataFrames and can be accessed through the `.index` attribute (e.g., `my_dataframe.index`), and column labels are always present in DataFrames and can be accessed through the `.columns` attribute (e.g., `my_dataframe.columns`).\n",
    "\n",
    "The existence of column names, I think most people would agree, is an unequivocal strength of pandas — it makes it easy to remember the *substantive* meaning of the different variables in your data, and to write code that very explicitly states what variables are being manipulated (as opposed to working with column numbers, as one does with numpy). \n",
    "\n",
    "The value of all pandas rows having index labels, however, is less clear. Actually, strike that. We're comfortable saying that **the fact all pandas rows come with index labels is one of the most confusing and least useful features of pandas.** And so in this reading, while we will review the basics of how these index labels work, we our primary focus will actually be to help you learn how to prevent them from getting in the way.\n",
    "\n",
    "This is, admittedly, a very opinionated take on index labels, but I think that the fact that pandas is the *only* tabular data library we know of that has this feature — including many libraries developed since pandas that could have adopted this approach if they'd wanted — is illustrative of the fact that most people don't see this as a particularly valuable addition to tabular data structures like Series and DataFrames.\n",
    "\n",
    "## Vocabulary, A Review\n",
    "\n",
    "As noted in our readings on Series and DataFrames, **in this course** we will try to always make use of two distinct terms for the different ways to identify rows in a pandas data structure:\n",
    "\n",
    "- *Index labels*: these are labels for rows that can be accessed through the `.index` attribute (e.g., `my_series.index`). To access a row using index labels, one uses the `.loc[]` operator. Note that what *we* call index labels are usually just referred to as the \"index\" in most documentation about pandas.\n",
    "- *Row numbers*: these are the numbers one would naturally assign to the rows of a Series or DataFrame based on the relative order in which they appear, starting with row 0 (since this is Python), then moving to row 1, row 2, etc. These row numbers operate the same way row numbers operate in numpy (or in other languages, like R, Julia, Java, etc.), and you can access rows by row number using `.iloc[]`. \n",
    "\n",
    "While these seem like obviously distinct concepts in the abstract, things quickly become confusing because the *default* index labels pandas assigns to rows is... their row numbers. 🤦‍♂️ As a result, when you load a new dataset, if you don't set the index explicitly, each row's index labels and row numbers will be the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>14495</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08  polityIV\n",
       "0      Brazil  S. America      10296        18\n",
       "1     Germany   W. Europe      35613        20\n",
       "2      Mexico  N. America      14495        18\n",
       "3  Mozambique      Africa        855        16\n",
       "4      Russia  C&E Europe      16139        17\n",
       "5     Ukraine  C&E Europe       7271        16"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "small_world = pd.read_csv(\"data/world-very-small.csv\")\n",
    "small_world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, for example, Mexico's index label is `2`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country          Mexico\n",
       "region       N. America\n",
       "gdppcap08         14495\n",
       "polityIV             18\n",
       "Name: 2, dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "small_world.loc[2, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*And* its row number is `2`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country          Mexico\n",
       "region       N. America\n",
       "gdppcap08         14495\n",
       "polityIV             18\n",
       "Name: 2, dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "small_world.iloc[2, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But the fact that index labels and row numbers *start off* as being the same does not mean they are guaranteed to stay that way. That's because when one sorts a Series or DataFrame, index labels *stay with the observation* with which they were originally associated, while row numbers always depend on the relative order in which observations appear.\n",
    "\n",
    "Thus if I sort our DataFrame by GDP per capita (thus shuffling the order of rows), you will see that index label `2` has followed Mexico to the fourth row (row number 3, since we count from 0):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>14495</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08  polityIV\n",
       "3  Mozambique      Africa        855        16\n",
       "5     Ukraine  C&E Europe       7271        16\n",
       "0      Brazil  S. America      10296        18\n",
       "2      Mexico  N. America      14495        18\n",
       "4      Russia  C&E Europe      16139        17\n",
       "1     Germany   W. Europe      35613        20"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted_small_world = small_world.sort_values(\"gdppcap08\")\n",
    "sorted_small_world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This illustrates a key feature of index labels: in that they stay with a given observation regardless of how the data is sorted, they are like any other column of the DataFrame. The only difference is that the *syntax* for manipulating DataFrames using the index labels is different from the syntax for other columns. Want to sort by any column other than the index? Use `.sort_values(\"name_of_column\")`. Want to sort by the index labels? `.sort_index()`. Want to get rows where the column `polityIV == 16`? `small_world[small_world.polityIV == 16]`. Want to get the rows where the index is equal to 16? `small_world.loc[16]`. \n",
    "\n",
    "And that, in a nutshell, is the problem: index labels make one column of data special, and in the process, it forces users to write code to access that column in a manner that's different from any other column. And for nearly all use cases, for no good reason (we'll touch on the \"why\" argument defenders of index labels would make below). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Minimalist Strategy for Working with Index Labels\n",
    "\n",
    "**The best advice we can give you for working with index labels is to ignore them whenever you can, and if you do happen to need the data in index labels, just copy the data in your index labels over into a normal column and work with it as you would any other variable.**\n",
    "\n",
    "To illustrate, suppose you have a dataset with data you actually care about stored as index labels. This can happen — even if you don't want it to! — as a side effect of some functions we will learn about next week (like `.groupby()` or `.pivot()`). To illustrate, though, we will just force pandas to use the GDP per capita column as the index when we load our data using the `index_col` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gdppcap08</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10296</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35613</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14495</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>855</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16139</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7271</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              country      region  polityIV\n",
       "gdppcap08                                  \n",
       "10296          Brazil  S. America        18\n",
       "35613         Germany   W. Europe        20\n",
       "14495          Mexico  N. America        18\n",
       "855        Mozambique      Africa        16\n",
       "16139          Russia  C&E Europe        17\n",
       "7271          Ukraine  C&E Europe        16"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data_in_index = pd.read_csv(\"data/world-very-small.csv\", index_col=\"gdppcap08\")\n",
    "world_data_in_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously, GDP per capita is data we will probably want to use, so we want to move the data into a normal column. We can do this with the `.reset_index()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10296</td>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>35613</td>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>14495</td>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>855</td>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16139</td>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>7271</td>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   gdppcap08     country      region  polityIV\n",
       "0      10296      Brazil  S. America        18\n",
       "1      35613     Germany   W. Europe        20\n",
       "2      14495      Mexico  N. America        18\n",
       "3        855  Mozambique      Africa        16\n",
       "4      16139      Russia  C&E Europe        17\n",
       "5       7271     Ukraine  C&E Europe        16"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world_data = world_data_in_index.reset_index()\n",
    "world_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's it! The data has just been moved over into a normal column, and we can work with it the way we normally would and once again ignore our index labels. \n",
    "\n",
    "(Note that index labels don't always have a name, in which case they will become a column called `\"index\"` and you may have to rename the column)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Index Alignment\n",
    "\n",
    "As we said above, in our view the best way to deal with index labels is to ignore them, and if you somehow end up with data stored as index labels, move that data back to a normal column with `.reset_index()`.\n",
    "\n",
    "But there is one pandas behavior related to index labels all users must be aware of, even those of us who try and avoid indexes as much as possible: **index alignment.**\n",
    "\n",
    "In almost any library that supports arrays, when you try to interact two vectors or arrays (like adding them together), observations are matched up by the order they appear. For example, in numpy, if I add two vectors of length three, the first entry of the first vector will be added to the first entry of the second vector, the second entry will be added to the second entry, etc.:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5, 7, 9])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "my_vector1 = np.array([1, 2, 3])\n",
    "my_vector2 = np.array([4, 5, 6])\n",
    "\n",
    "my_vector1 + my_vector2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But if you attempt to interact two pandas Series, observations will be matched up not by their row numbers, but by index labels, as shown in this figure:\n",
    "\n",
    "![Indices Figure](img/3.3.7-indices.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    6\n",
       "b    6\n",
       "c    9\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.Series([1, 2, 3], index=[\"a\", \"b\", \"c\"])\n",
    "df2 = pd.Series([4, 5, 6], index=[\"b\", \"a\", \"c\"])\n",
    "\n",
    "df1 + df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is... bananas. We hate this behavior. If you know what a `join` or a `merge` are, index labels are doing an implicit outer join (if you don't know about those, don't worry — we'll cover them next week!). But this behavior exists, so you need to know about it.\n",
    "\n",
    "Moreover, if the Series' index labels don't fully overlap, the result — say the second Series doesn't have an index label of `\"c\"`, pandas will generate a result that has all possible index label values (the union of the two index labels) with data just missing (the value will be `NaN`, or Not a Number, which we'll discuss more later this week):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    6.0\n",
       "b    6.0\n",
       "c    NaN\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.Series([1, 2, 3], index=[\"a\", \"b\", \"c\"])\n",
    "df2 = pd.Series([4, 5], index=[\"b\", \"a\"])\n",
    "\n",
    "df1 + df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Does This Come Up a Lot?\n",
    "\n",
    "OK, now the good news: it is pretty rare this comes up. *Most* of the time when users are manipulating a pair of Series, they are different variables from the same DataFrame. For example, dividing GDP by population to get GDP per capita:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>GDP</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>United States</td>\n",
       "      <td>26949643</td>\n",
       "      <td>334233854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Argentina</td>\n",
       "      <td>621833</td>\n",
       "      <td>47327407</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         country       GDP  population\n",
       "0  United States  26949643   334233854\n",
       "1      Argentina    621833    47327407"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world = pd.DataFrame(\n",
    "    {\n",
    "        \"country\": [\"United States\", \"Argentina\"],\n",
    "        \"GDP\": [26_949_643, 621_833],\n",
    "        \"population\": [334_233_854, 47_327_407],\n",
    "    }\n",
    ")\n",
    "world"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>GDP</th>\n",
       "      <th>population</th>\n",
       "      <th>gdp_per_cap</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>United States</td>\n",
       "      <td>26949643</td>\n",
       "      <td>334233854</td>\n",
       "      <td>0.080631</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Argentina</td>\n",
       "      <td>621833</td>\n",
       "      <td>47327407</td>\n",
       "      <td>0.013139</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         country       GDP  population  gdp_per_cap\n",
       "0  United States  26949643   334233854     0.080631\n",
       "1      Argentina    621833    47327407     0.013139"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world[\"gdp_per_cap\"] = world[\"GDP\"] / world[\"population\"]\n",
    "world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And since all Series in a DataFrame share a common index, observations will line up the way you would expect when doing this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    26949643\n",
       "1      621833\n",
       "Name: GDP, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world[\"GDP\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    334233854\n",
       "1     47327407\n",
       "Name: population, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world[\"population\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But it's an issue that *can* arise, so it's important you know about it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## So... If They're Such A Pain, Why Do Index Labels Exist?\n",
    "\n",
    "I think index labels came into existence out of a well-meaning desire to make life easier for users by offering two special features for free. The first is index alignment. By *implicitly* doing a merge/join of pandas Series, I think the idea was that it would make life easier by not requiring users to do their merges explicitly. But... sometimes it's better to have simple, predictable behaviors (matching up row 1 to row 1, etc.), and to let people do merges explicitly when they want.\n",
    "\n",
    "The second is that pulling out rows using `.loc[]` and an index label (e.g., `small_world.loc[16]`) is faster than subsetting from a regular column (e.g., `small_world.loc[small_world[\"polityIV\"] == 16]`). That's because index labels are basically like a Python dictionary — getting values using an index label is an O(1) operation (if you recall that discussion from previous weeks), while subsetting based on regular column values is O(N).\n",
    "\n",
    "The problem, however, is that grabbing an arbitrary row really fast isn't a common data science use case, while index labels are present in all pandas data structures. Having a way to get an arbitrary row quickly is a nice feature, and one other libraries have implemented as an *optional* feature. But baking it into all data structures just complicates everything to accommodate something most people never do with pandas."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}