{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with tabular data through Dataframes\n",
    "\n",
    "Previously, we learned about Series: an ordered collection of observations, analogous to a `numpy` vector but with superpowers. \n",
    "\n",
    "In this tutorial, we'll learn about DataFrames, a method of holding *tabular* data in which each row is an observation, and each column is a variable. (OK, there are some different forms of tabular data, but that's the most common format you'll encounter). \n",
    "\n",
    "To illustrate, here's a small `pandas` DataFrame (created by importing data from a spreadsheet you can find [here](https://github.com/nickeubank/practicaldatascience/blob/master/Example_Data/world-very-small.csv)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>14495</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08  polityIV\n",
       "0      Brazil  S. America      10296        18\n",
       "1     Germany   W. Europe      35613        20\n",
       "2      Mexico  N. America      14495        18\n",
       "3  Mozambique      Africa        855        16\n",
       "4      Russia  C&E Europe      16139        17\n",
       "5     Ukraine  C&E Europe       7271        16"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "smallworld = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/nickeubank/\"\n",
    "    \"practicaldatascience/master/Example_Data/world-very-small.csv\"\n",
    ")\n",
    "smallworld"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, each of the 6 rows in the DataFrame `world` is a different country, and each column contains different information about that country—the country's name, its region, its income level (GDP per Capita in 2008), and how close it was to an idealized liberal democracy in 2008 (it's polity IV score). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is a DataFrame?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Where a Series was a one-dimensional collection of data, a DataFrame is fundamentally two-dimensional. As a result, it has many of the same types of features as a Series, just generalized to two dimensions. Here we show the breakdown of key aspects of a DataFrame that we'll discuss throughout this lesson."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<img src=\"img/3.2.30-pandas_dataframe_anatomy.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Index and Columns\n",
    "\n",
    "For example, like a Series, a DataFrame has an index that labels every row: in this case, it's the usual default index that labels each row with its initial row number. Unlike a Series, however, DataFrames have a second set of labels: column names!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=6, step=1)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here are the row labels\n",
    "# (Note that a \"range index\" is just\n",
    "# another way of labeling each row with its row number)\n",
    "smallworld.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['country', 'region', 'gdppcap08', 'polityIV'], dtype='object')"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# And here is our column index.\n",
    "# Note that while we don't call it \"index\",\n",
    "# the column names are of type Index.\n",
    "# They really are the same as row indices,\n",
    "# just for columns\n",
    "\n",
    "smallworld.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Constructing DataFrames\n",
    "\n",
    "As with Series, there are many ways to construct a DataFrame. Honestly, by far the most common is that you'll read in a dataset from a file. Pandas offers lots of tools for doing this depending on the format of the data you're importing. We'll discuss this more in future lessons, but here are just a few methods to know about: \n",
    "\n",
    "- `pd.read_csv`: Read in a comma-separated-value file\n",
    "- `pd.read_excel`: Read in an Excel (`.xls` and `.xlsx`) spreadsheet\n",
    "- `pd.read_stata`: Read Stata (`.dta`) datasets\n",
    "- `pd.read_hdf`: Read HDF (`.hdf`) datasets\n",
    "- `pd.read_sql`: Read from a SQL database\n",
    "\n",
    "Similarly, if we have an existing DataFrame (let's call it `df`) we want to output to a file or database, we can use complimentary methods of `df` to do so such as:\n",
    "\n",
    "- `df.to_csv`: Write to a comma-separated-value file\n",
    "- `df.to_excel`: Write to an Excel (`.xls` and `.xlsx`) spreadsheet\n",
    "- `df.to_stata`: Write to a stata (`.dta`) dataset\n",
    "- `df.to_hdf`: Write to an HDF (`.hdf`) dataset\n",
    "- `df.to_sql`: Write to a SQL database\n",
    "\n",
    "You can find a full list of [IO methods here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But you can also construct DataFrames by hand. The easiest (and most common) way is by passing in a Dictionary, where the keys will become column names and the values are column values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>animals</th>\n",
       "      <th>can_swim</th>\n",
       "      <th>has_fur</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>dog</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cat</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bird</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>fish</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  animals  can_swim  has_fur\n",
       "0     dog      True     True\n",
       "1     cat     False     True\n",
       "2    bird     False    False\n",
       "3    fish      True    False"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"animals\": [\"dog\", \"cat\", \"bird\", \"fish\"],\n",
    "        \"can_swim\": [True, False, False, True],\n",
    "        \"has_fur\": [True, True, False, False],\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting To Know Your DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While our toy `smallworld` dataset is small enough to easy print out and visualize, most datasets worth working with are too big to just look at. In those situations, we need tools to summarize the contents of our DataFrame. \n",
    "\n",
    "Let's load up a version of the `smallworld` dataset we looked at above that actually has all the countries in the world (instead of just 6). You can find the original dataset [here](https://github.com/nickeubank/practicaldatascience/blob/master/Example_Data/world-small.csv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "      <td>17.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Angola</td>\n",
       "      <td>Africa</td>\n",
       "      <td>5899</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Argentina</td>\n",
       "      <td>S. America</td>\n",
       "      <td>14333</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Armenia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>6070</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>Venezuela</td>\n",
       "      <td>S. America</td>\n",
       "      <td>12804</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <td>Vietnam</td>\n",
       "      <td>Asia-Pacific</td>\n",
       "      <td>2785</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>142</th>\n",
       "      <td>Yemen</td>\n",
       "      <td>Middle East</td>\n",
       "      <td>2400</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143</th>\n",
       "      <td>Zambia</td>\n",
       "      <td>Africa</td>\n",
       "      <td>1356</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>Africa</td>\n",
       "      <td>188</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>145 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       country        region  gdppcap08  polityIV\n",
       "0      Albania    C&E Europe       7715      17.8\n",
       "1      Algeria        Africa       8033      10.0\n",
       "2       Angola        Africa       5899       8.0\n",
       "3    Argentina    S. America      14333      18.0\n",
       "4      Armenia    C&E Europe       6070      15.0\n",
       "..         ...           ...        ...       ...\n",
       "140  Venezuela    S. America      12804      16.0\n",
       "141    Vietnam  Asia-Pacific       2785       3.0\n",
       "142      Yemen   Middle East       2400       8.0\n",
       "143     Zambia        Africa       1356      15.0\n",
       "144   Zimbabwe        Africa        188       6.0\n",
       "\n",
       "[145 rows x 4 columns]"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/nickeubank/\"\n",
    "    \"practicaldatascience/master/Example_Data/world-small.csv\"\n",
    ")\n",
    "world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, `pandas` prints out a bunch of the rows, but not all the rows (note the `...` in the middle) in an effort to not take over your monitor. This DataFrame could theoretically be printed out in its entirety (as noted at the bottom of the output, it only has 145 rows), but in the real world, we often work with datasets with hundreds of thousands or millions of rows where printing just isn't possible. So here are some methods for \"getting to know your data\":"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first 5 rows using `.head()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "      <td>17.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Angola</td>\n",
       "      <td>Africa</td>\n",
       "      <td>5899</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Argentina</td>\n",
       "      <td>S. America</td>\n",
       "      <td>14333</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Armenia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>6070</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     country      region  gdppcap08  polityIV\n",
       "0    Albania  C&E Europe       7715      17.8\n",
       "1    Algeria      Africa       8033      10.0\n",
       "2     Angola      Africa       5899       8.0\n",
       "3  Argentina  S. America      14333      18.0\n",
       "4    Armenia  C&E Europe       6070      15.0"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And look at the last 5 rows with `.tail()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>Venezuela</td>\n",
       "      <td>S. America</td>\n",
       "      <td>12804</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <td>Vietnam</td>\n",
       "      <td>Asia-Pacific</td>\n",
       "      <td>2785</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>142</th>\n",
       "      <td>Yemen</td>\n",
       "      <td>Middle East</td>\n",
       "      <td>2400</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143</th>\n",
       "      <td>Zambia</td>\n",
       "      <td>Africa</td>\n",
       "      <td>1356</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>Africa</td>\n",
       "      <td>188</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country        region  gdppcap08  polityIV\n",
       "140  Venezuela    S. America      12804      16.0\n",
       "141    Vietnam  Asia-Pacific       2785       3.0\n",
       "142      Yemen   Middle East       2400       8.0\n",
       "143     Zambia        Africa       1356      15.0\n",
       "144   Zimbabwe        Africa        188       6.0"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.tail(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View a random subset of rows (here, 5). This is valuable because the first rows of a dataset aren't always representative of the dataset. Often datasets are ordered (this dataset is sorted alphabetically by country), and seeing the first or last few entries can sometimes be misleading. Random sampling can reduce this effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Oman</td>\n",
       "      <td>Middle East</td>\n",
       "      <td>22478</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Chad</td>\n",
       "      <td>Africa</td>\n",
       "      <td>1455</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Norway</td>\n",
       "      <td>Scandinavia</td>\n",
       "      <td>58138</td>\n",
       "      <td>20.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Central African Republic</td>\n",
       "      <td>Africa</td>\n",
       "      <td>736</td>\n",
       "      <td>10.200000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>108</th>\n",
       "      <td>Romania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>14065</td>\n",
       "      <td>18.333333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      country       region  gdppcap08   polityIV\n",
       "99                       Oman  Middle East      22478   2.000000\n",
       "23                       Chad       Africa       1455   8.000000\n",
       "98                     Norway  Scandinavia      58138  20.000000\n",
       "22   Central African Republic       Africa        736  10.200000\n",
       "108                   Romania   C&E Europe      14065  18.333333"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `len()` to get the number of rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "145"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(world)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or `len()` and `.columns` to get the number of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(world.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check the data type of each column with `.dtypes` (note the `s`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country       object\n",
       "region        object\n",
       "gdppcap08      int64\n",
       "polityIV     float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or a slightly more nicely formatted summary with `.info()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 145 entries, 0 to 144\n",
      "Data columns (total 4 columns):\n",
      " #   Column     Non-Null Count  Dtype  \n",
      "---  ------     --------------  -----  \n",
      " 0   country    145 non-null    object \n",
      " 1   region     145 non-null    object \n",
      " 2   gdppcap08  145 non-null    int64  \n",
      " 3   polityIV   145 non-null    float64\n",
      "dtypes: float64(1), int64(1), object(2)\n",
      "memory usage: 4.7+ KB\n"
     ]
    }
   ],
   "source": [
    "world.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also summary statistics for each numeric column (objects are ignored) with `.describe()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>145.000000</td>\n",
       "      <td>145.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>13251.993103</td>\n",
       "      <td>13.407816</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>14802.581676</td>\n",
       "      <td>6.587626</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>188.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2153.000000</td>\n",
       "      <td>7.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>7271.000000</td>\n",
       "      <td>16.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>19330.000000</td>\n",
       "      <td>19.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>85868.000000</td>\n",
       "      <td>20.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          gdppcap08    polityIV\n",
       "count    145.000000  145.000000\n",
       "mean   13251.993103   13.407816\n",
       "std    14802.581676    6.587626\n",
       "min      188.000000    0.000000\n",
       "25%     2153.000000    7.666667\n",
       "50%     7271.000000   16.000000\n",
       "75%    19330.000000   19.000000\n",
       "max    85868.000000   20.000000"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can list out all the column names. Note that when a table has a lot of columns, the output of `.columns` will be truncated. When that happens, this little hack will ensure you get to see all the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\n",
      "region\n",
      "gdppcap08\n",
      "polityIV\n"
     ]
    }
   ],
   "source": [
    "for c in world.columns:\n",
    "    print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Subsetting a DataFrame\n",
    "\n",
    "As with Series, one of the most important skills for working with DataFrames is knowing how to subset them. Thankfully, subsetting DataFrames is just like subsetting Series but in two dimensions.\n",
    "\n",
    "### .iloc\n",
    "\n",
    "To subset a DataFrame using `iloc`, we now have to pass two arguments into `iloc` separated by a comma (the first entry subsets rows, the second columns, just like in `numpy`). \n",
    "\n",
    "For example, if we wanted the entry in the fourth row of the first column, we would use: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Argentina'"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.iloc[3, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And like with Series and `numpy`, `iloc` supports slices. Here are the first two rows of the first three columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   country      region  gdppcap08\n",
       "0  Albania  C&E Europe       7715\n",
       "1  Algeria      Africa       8033"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.iloc[0:2, 0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to get a subset on one dimension, but *all* the entries on the other, just pass `:` for the dimension on which you want all the data (just like in `numpy`). Here are the first two rows and all the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "      <td>17.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   country      region  gdppcap08  polityIV\n",
       "0  Albania  C&E Europe       7715      17.8\n",
       "1  Algeria      Africa       8033      10.0"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.iloc[0:2, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you ONLY pass one set of arguments, though, those will be applied to the first dimension (rows), just like in `numpy`. Thus `.iloc[0:2]` is the same as `.iloc[0:2, :]`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "      <td>17.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   country      region  gdppcap08  polityIV\n",
       "0  Albania  C&E Europe       7715      17.8\n",
       "1  Algeria      Africa       8033      10.0"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.iloc[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### loc\n",
    "\n",
    "`.loc` generalized from Series to DataFrames using the same tricks as `.iloc`: if you pass two arguments, the first will subset rows (though for `.loc`, the subsetting is on index values, not row numbers), and the second will subset columns (again, on column names, not column order). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Algeria'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Index value 1, column country\n",
    "world.loc[1, \"country\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And just like in Series, if you pass a range to `.loc`, the end points will be included (unlike with most Python functions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Albania\n",
       "1    Algeria\n",
       "Name: country, dtype: object"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.loc[0:1, \"country\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, as with `.iloc`, if you pass a single argument to `.loc`, it will subset on the first dimension (rows):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Albania</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7715</td>\n",
       "      <td>17.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Algeria</td>\n",
       "      <td>Africa</td>\n",
       "      <td>8033</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Angola</td>\n",
       "      <td>Africa</td>\n",
       "      <td>5899</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Argentina</td>\n",
       "      <td>S. America</td>\n",
       "      <td>14333</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     country      region  gdppcap08  polityIV\n",
       "0    Albania  C&E Europe       7715      17.8\n",
       "1    Algeria      Africa       8033      10.0\n",
       "2     Angola      Africa       5899       8.0\n",
       "3  Argentina  S. America      14333      18.0"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.loc[0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logical Tests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subsetting with logical tests also works in a familiar manner for DataFrames: \n",
    "\n",
    "- If you pass a single boolean array to `.loc`, it will subset on rows. \n",
    "- If the Boolean array has an Index (i.e. if it's a Series), then alignment will take place on index values.\n",
    "- If the Boolean array does NOT have an index (i.e. it's a list of Booleans), then alignment will take place on row order. \n",
    "- To subset columns based on a test, you have to use `.loc[:, YOUR_TEST_HERE]`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate, let's start by shuffling our DataFrame so that index values and row numbers aren't the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>Africa</td>\n",
       "      <td>188</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Congo Kinshasa</td>\n",
       "      <td>Africa</td>\n",
       "      <td>321</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>Liberia</td>\n",
       "      <td>Africa</td>\n",
       "      <td>388</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Guinea-Bissau</td>\n",
       "      <td>Africa</td>\n",
       "      <td>538</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Eritrea</td>\n",
       "      <td>Africa</td>\n",
       "      <td>632</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            country  region  gdppcap08  polityIV\n",
       "144        Zimbabwe  Africa        188       6.0\n",
       "29   Congo Kinshasa  Africa        321      15.0\n",
       "76          Liberia  Africa        388      10.0\n",
       "53    Guinea-Bissau  Africa        538      11.0\n",
       "40          Eritrea  Africa        632       3.0"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world = world.sort_values(\"gdppcap08\")\n",
    "world.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Congo Kinshasa</td>\n",
       "      <td>Africa</td>\n",
       "      <td>321</td>\n",
       "      <td>15.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Guinea-Bissau</td>\n",
       "      <td>Africa</td>\n",
       "      <td>538</td>\n",
       "      <td>11.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Niger</td>\n",
       "      <td>Africa</td>\n",
       "      <td>684</td>\n",
       "      <td>15.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Central African Republic</td>\n",
       "      <td>Africa</td>\n",
       "      <td>736</td>\n",
       "      <td>10.200000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>Sierra Leone</td>\n",
       "      <td>Africa</td>\n",
       "      <td>766</td>\n",
       "      <td>15.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      country  region  gdppcap08   polityIV\n",
       "29             Congo Kinshasa  Africa        321  15.000000\n",
       "53              Guinea-Bissau  Africa        538  11.000000\n",
       "96                      Niger  Africa        684  15.333333\n",
       "22   Central African Republic  Africa        736  10.200000\n",
       "113              Sierra Leone  Africa        766  15.000000"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Test with an index -> subset rows, align on index\n",
    "relatively_democratic = world.loc[world[\"polityIV\"] > 10]\n",
    "relatively_democratic.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And if we want to subset columns on a Boolean (admittedly a silly example, but you get the idea):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>gdppcap08</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Congo Kinshasa</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Guinea-Bissau</td>\n",
       "      <td>538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Niger</td>\n",
       "      <td>684</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Central African Republic</td>\n",
       "      <td>736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>Sierra Leone</td>\n",
       "      <td>766</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93</th>\n",
       "      <td>Netherlands</td>\n",
       "      <td>40849</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>Switzerland</td>\n",
       "      <td>42536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>62</th>\n",
       "      <td>Ireland</td>\n",
       "      <td>44200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>137</th>\n",
       "      <td>United States</td>\n",
       "      <td>46716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Norway</td>\n",
       "      <td>58138</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>96 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      country  gdppcap08\n",
       "29             Congo Kinshasa        321\n",
       "53              Guinea-Bissau        538\n",
       "96                      Niger        684\n",
       "22   Central African Republic        736\n",
       "113              Sierra Leone        766\n",
       "..                        ...        ...\n",
       "93                Netherlands      40849\n",
       "124               Switzerland      42536\n",
       "62                    Ireland      44200\n",
       "137             United States      46716\n",
       "98                     Norway      58138\n",
       "\n",
       "[96 rows x 2 columns]"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "relatively_democratic = relatively_democratic.loc[\n",
    "    :, (world.columns == \"country\") | (world.columns == \"gdppcap08\")\n",
    "]\n",
    "relatively_democratic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Reminder on Combining Logical Tests\n",
    "\n",
    "[As with numpy](https://www.practicaldatascience.org/html/30_subsetting_vectors.html#Subsetting-With-Logical-Operations), when combining logical tests, you have to wrap each test in parentheses. If you don't, and just pass `world.columns == \"country\" | world.columns == \"gdppcap08\"`, you will get a *very* confusing Error:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "\n",
    "relatively_democratic.loc[\n",
    "    :, world.columns == \"country\" | world.columns == \"gdppcap08\"\n",
    "]\n",
    "\n",
    "---------------------------------------------------------------------------\n",
    "TypeError                                 Traceback (most recent call last)\n",
    "File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/ops/array_ops.py:311, in na_logical_op(x, y, op)\n",
    "    302 try:\n",
    "    303     # For exposition, write:\n",
    "    304     #  yarr = isinstance(y, np.ndarray)\n",
    "   (...)\n",
    "    309     # Then Cases where this goes through without raising include:\n",
    "    310     #  (xint or xbool) and (yint or bool)\n",
    "--> 311     result = op(x, y)\n",
    "    312 except TypeError:\n",
    "\n",
    "File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/roperator.py:58, in ror_(left, right)\n",
    "     57 def ror_(left, right):\n",
    "---> 58     return operator.or_(right, left)\n",
    "\n",
    "TypeError: unsupported operand type(s) for |: 'str' and 'str'\n",
    "\n",
    "...\n",
    "\n",
    "    339 return result.reshape(x.shape)\n",
    "\n",
    "TypeError: Cannot perform 'ror_' with a dtyped [object] array and scalar of type [bool]\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `[]` Square brackets\n",
    "\n",
    "As with Series, single square brackets in `pandas` change their behavior depending on the values you pass them. Again, it is worth emphasizing that there is *nothing* that one can do with square brackets that you can't do with `.loc` and `.iloc`, so if they seem to strange, you don't have to use them. \n",
    "\n",
    "With that said, as summarized below, `[]` is actually much safer on DataFrames than on Series. \n",
    "\n",
    "The rules of `[]` in DataFrames are:\n",
    "\n",
    "- If your entry is a *single* column name, or a list of column names, it will return those columns. \n",
    "- If your entry is a slice, it will work like `iloc` and select rows based on row order. \n",
    "- If your entry is a Boolean array, *and* of exactly the same length as the number of rows in your data, it will subset rows.\n",
    "    - Note this means that `[]` does not do the same thing we saw `.loc` do above where, if passed a short Boolean array, it will assume any row without an entry in the Boolean array should be dropped."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "144          Zimbabwe\n",
       "29     Congo Kinshasa\n",
       "76            Liberia\n",
       "53      Guinea-Bissau\n",
       "40            Eritrea\n",
       "Name: country, dtype: object"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select one column\n",
    "world[\"country\"].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>gdppcap08</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Congo Kinshasa</td>\n",
       "      <td>321</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>Liberia</td>\n",
       "      <td>388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Guinea-Bissau</td>\n",
       "      <td>538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Eritrea</td>\n",
       "      <td>632</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            country  gdppcap08\n",
       "144        Zimbabwe        188\n",
       "29   Congo Kinshasa        321\n",
       "76          Liberia        388\n",
       "53    Guinea-Bissau        538\n",
       "40          Eritrea        632"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select multiple columns\n",
    "world[[\"country\", \"gdppcap08\"]].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>Macedonia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>10041</td>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>South Africa</td>\n",
       "      <td>Africa</td>\n",
       "      <td>10109</td>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Costa Rica</td>\n",
       "      <td>S. America</td>\n",
       "      <td>11241</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>Kazakhstan</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>11315</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          country      region  gdppcap08  polityIV\n",
       "79      Macedonia  C&E Europe      10041      19.0\n",
       "118  South Africa      Africa      10109      19.0\n",
       "16         Brazil  S. America      10296      18.0\n",
       "30     Costa Rica  S. America      11241      20.0\n",
       "68     Kazakhstan  C&E Europe      11315       4.0"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Boolean test\n",
    "world[world[\"gdppcap08\"] > 10000].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>Africa</td>\n",
       "      <td>188</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Congo Kinshasa</td>\n",
       "      <td>Africa</td>\n",
       "      <td>321</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>Liberia</td>\n",
       "      <td>Africa</td>\n",
       "      <td>388</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            country  region  gdppcap08  polityIV\n",
       "144        Zimbabwe  Africa        188       6.0\n",
       "29   Congo Kinshasa  Africa        321      15.0\n",
       "76          Liberia  Africa        388      10.0"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Slice of rows\n",
    "world[0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrames: Collections of Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While it is natural to think of a DataFrame as a single table (like a numpy matrix), in reality, **a DataFrame** is just a collection of Series.  \n",
    "\n",
    "To see this, let's pull out an individual column using square bracket notation, and check its type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world[\"country\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tada! \n",
    "\n",
    "And that means that you can always pull out a column from a DataFrame and manipulate it using the tools you've already learned from the Series tutorial. And because you know how to extract the `numpy` array that underlies a Series, that means you also always know how to move from DataFrames to `numpy` arrays if you need to:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.ndarray"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get numpy array under one column\n",
    "type(world[\"country\"].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Selecting Series versus Selecting DataFrames\n",
    "\n",
    "There is one point of nuance worth exploring: when you extract a single column from a DataFrame, you have the choice of either extracting a Series, or extracting a DataFrame with a single column. What determines this is whether you use one pair of square brackets, or two. \n",
    "\n",
    "If you use a single set of square brackets (or pass just the name of a column to `loc`, you get back a Series. But if you pass a *list* with the column name, you get back a DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world[\"country\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world[[\"country\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This also holds for rows, by the way. If you ask for a single row, you will actually get back a (newly construted) Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(world.iloc[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Obviously, if you ask for more than one row, or more than one column, you will always get back a DataFrame, since the object you're requesting is intrinsically 2-dimensional and can't be represented as a Series. )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "We have explored how DataFrames are versatile tools for loading in (and writing out) data of diverse file types and for selecting subsets of those data (filtering) for further analysis. These techniques are at the core of how we typically work with tabular datasets in data science. Int he exercises that follow, you'll have a chance to get experience using those tools. And in the next week, we will dive deeply into how to work with DataFrames and use them to ready data for further analysis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.2 ('ds')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "b9e56a7b23b1fac2eea1a993b805ed5c611aea1439c1f46315b23590ab6d3ba0"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}