{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning Data Types\n",
    "\n",
    "In our past two readings, we learned about methods for cleaning our data by modifying the *values* of certain variables. But a second and extremely common data cleanliness problem pertains to managing data *types*. For example, it is very common to load data that you think *should* be numeric (say, `float64`), only to discover that pandas has loaded it as `Categorical` or `Object` type, which means you can't do numeric computations on the data (e.g., `mean`, `median`, etc.). \n",
    "\n",
    "There are two main reasons that this can happen:\n",
    "\n",
    "- There are observations that are *accidentally* being parsed as non-numeric, or\n",
    "- There are observations that are *deliberately* non-numeric you didn't realize were there.\n",
    "\n",
    "Because this is so common, let's look at a quick example of this type of problem with our old friend `world-very-small`! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613</td>\n",
       "      <td>20 or higher</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>na</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region gdppcap08      polityIV\n",
       "0      Brazil  S. America     10296            18\n",
       "1     Germany   W. Europe     35613  20 or higher\n",
       "2      Mexico  N. America        na            18\n",
       "3  Mozambique      Africa       855            16\n",
       "4      Russia  C&E Europe     16139            17\n",
       "5     Ukraine  C&E Europe      7271            16"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "pd.set_option(\"mode.copy_on_write\", True)\n",
    "\n",
    "world = pd.read_csv(\"data/world-very-small-type-errors.csv\")\n",
    "world"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously we can see there are going to be issues with `gdppcap08` and `polityIV`, but this were a real (big) dataset where this wasn't evident and you tried to calculate the average GDP of countries in the data. You would get a *very* weird `ValueError` / `TypeError` that might look something like this (I'm removing most of the stuff in the middle for sanity):\n",
    "\n",
    "```python\n",
    "world[\"gdppcap08\"].mean()\n",
    "\n",
    "---------------------------------------------------------------------------\n",
    "ValueError                                Traceback (most recent call last)\n",
    "File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/nanops.py:1622, in _ensure_numeric(x)\n",
    "   1621 try:\n",
    "-> 1622     x = float(x)\n",
    "   1623 except (TypeError, ValueError):\n",
    "   1624     # e.g. \"1+1j\" or \"foo\"\n",
    "\n",
    "ValueError: could not convert string to float: '1029635613na855161397271'\n",
    "\n",
    "[...]\n",
    "\n",
    "File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/nanops.py:1629, in _ensure_numeric(x)\n",
    "   1626             x = complex(x)\n",
    "   1627         except ValueError as err:\n",
    "   1628             # e.g. \"foo\"\n",
    "-> 1629             raise TypeError(f\"Could not convert {x} to numeric\") from err\n",
    "   1630 return x\n",
    "\n",
    "TypeError: Could not convert 1029635613na855161397271 to numeric\n",
    "```\n",
    "\n",
    "What does all this mean? In short, it means that when pandas tried to run `mean()` it realized the method would only work if the data were numeric, so it tried to convert everything in the Series to a numeric type, but it found something in the Series it couldn't convert. Why is the error `TypeError: Could not convert 1029635613na855161397271 to numeric` and not `TypeError: Could not convert na to numeric`? Honestly, I don't have the foggiest notion, but that's what I got with my current version of pandas so I'm gonna leave it in this lesson so you know how weird these things can get!\n",
    "\n",
    "But the key take-away is this: what we *can* see is that there was a failure to convert a value to numeric. And that means there's something in our Series that can't be parsed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Finding Problematic Observations\n",
    "\n",
    "So the first thing we can do, as a sanity check, is check the `dtypes` of our dataframe to confirm that what we *thought* was a numeric column is in fact an `object` or `Categorical` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country      object\n",
       "region       object\n",
       "gdppcap08    object\n",
       "polityIV     object\n",
       "dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oops! Yup, there it is! Indeed, we can also see that `polityIV` is also an `object`, something we'll deal with later. \n",
    "\n",
    "So this kind of phenomenon is such a common problem that pandas has a tool to deal with it: `.str.isnumeric()`. This method returns `True` if an observation can be converted to a number without issues, and `False` otherwise, so by using its logical negation, we can find all observations that aren't convertible (which we may then need to fix):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "gdppcap08\n",
       "na    1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.loc[~world[\"gdppcap08\"].str.isnumeric(), \"gdppcap08\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Recall we have to use `~` instead of `not` to invert `True` values to `False` in a numpy array or pandas Series)\n",
    "\n",
    "There we are! One observation of `\"na\"` is the source of our problems. Note that you could also use `.unique()` instead of `.value_counts()` if you don't care about the number of observations that are causing problems. \n",
    "\n",
    "OK, so how do we fix this? In this case, it seems like `na` is probably just short for `np.nan`, so we can replace it with `np.nan`, which *is* a number:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "world.loc[world[\"gdppcap08\"] == \"na\", \"gdppcap08\"] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can either cast `gdppcap08` to a numeric type directly with `world[\"gdppcap08\"] = world[\"gdppcap08\"].astype(\"float\")`, or let pandas do it implicitly when we run `world[\"gdppcap08\"].mean()`. In general the former is probably better practice, so let's do that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14034.8"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world[\"gdppcap08\"] = world[\"gdppcap08\"].astype(\"float\")\n",
    "world[\"gdppcap08\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, there's one other option in these situations: if `na` is appearing in lots of places in your data as a representation for missing data, we can also communicate this fact to `read_csv`. Remember when we said that `read_csv` has more options than you could ever imagine? Well, one is `na_values`, where you can specify how missing data is represented in a dataset. If we use that to tell `read_csv` that `\"na\"` means data is missing, it'll make this correction on the fly!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Brazil</td>\n",
       "      <td>S. America</td>\n",
       "      <td>10296.0</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613.0</td>\n",
       "      <td>20 or higher</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mexico</td>\n",
       "      <td>N. America</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mozambique</td>\n",
       "      <td>Africa</td>\n",
       "      <td>855.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Russia</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>16139.0</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>C&amp;E Europe</td>\n",
       "      <td>7271.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country      region  gdppcap08      polityIV\n",
       "0      Brazil  S. America    10296.0            18\n",
       "1     Germany   W. Europe    35613.0  20 or higher\n",
       "2      Mexico  N. America        NaN            18\n",
       "3  Mozambique      Africa      855.0            16\n",
       "4      Russia  C&E Europe    16139.0            17\n",
       "5     Ukraine  C&E Europe     7271.0            16"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world = pd.read_csv(\"data/world-very-small-type-errors.csv\", na_values=\"na\")\n",
    "world"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country       object\n",
       "region        object\n",
       "gdppcap08    float64\n",
       "polityIV      object\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14034.8"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world[\"gdppcap08\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Magic!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deliberate Non-Numerics\n",
    "\n",
    "While in the example above `\"na\"` wasn't really meant to be interpreted as a string, sometimes non-numeric values are inserted into datasets to communicate something important to the user. For example, in this toy example, `polityIV` has a non-numeric value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "      <th>gdppcap08</th>\n",
       "      <th>polityIV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Germany</td>\n",
       "      <td>W. Europe</td>\n",
       "      <td>35613.0</td>\n",
       "      <td>20 or higher</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   country     region  gdppcap08      polityIV\n",
       "1  Germany  W. Europe    35613.0  20 or higher"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "world.loc[~world[\"polityIV\"].str.isnumeric()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the value is to indicate that the value of `polityIV` could be 20 *or any value above 20*. Why might this occur? In this case, I've made it up, but in many surveys, there will be a maximum value allowed (e.g., some surveys have a top income value they can record). These are called \"top-codes\", and they are different from the case above because it's not immediately clear what value you should put into the field. You *could* put in 20, but... the value might not be 20, it could be 21!\n",
    "\n",
    "So when you find a top-code like this, you need to think carefully about what to do with it, as the answer will depend on the type of analysis you want to do. But the answer will almost always lie not in the data, but in the documentation for the data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review\n",
    "\n",
    "- Sometimes data you think should be numeric won't be recognized as numeric by pandas.\n",
    "- If you try and do a numeric operation on these columns, you'll get a `TypeError` and/or `ValueError`, and usually a note about an inability to convert some value to numeric.\n",
    "- You can confirm that pandas sees a column as non-numeric with `.dtypes`.\n",
    "- You can find the non-numeric values with `df[~ df[\"column\"].str.isnumeric(), \"column\"].value_counts()`. \n",
    "- You can then replace these values with numeric values if it's clear how to do so and recast the column to numeric with `df[\"column\"] = df[\"column\"].astype(\"float\")`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.7 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "vscode": {
   "interpreter": {
    "hash": "3e0a5228cb9726a24d36227c69ed0d3aac98cecda769d1c9adb080711d57f90d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}