{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning Data Types\n", "\n", "In our past two readings, we learned about methods for cleaning our data by modifying the *values* of certain variables. But a second and extremely common data cleanliness problem pertains to managing data *types*. For example, it is very common to load data that you think *should* be numeric (say, `float64`), only to discover that pandas has loaded it as `Categorical` or `Object` type, which means you can't do numeric computations on the data (e.g., `mean`, `median`, etc.). \n", "\n", "There are two main reasons that this can happen:\n", "\n", "- There are observations that are *accidentally* being parsed as non-numeric, or\n", "- There are observations that are *deliberately* non-numeric you didn't realize were there.\n", "\n", "Because this is so common, let's look at a quick example of this type of problem with our old friend `world-very-small`! " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregiongdppcap08polityIV
0BrazilS. America1029618
1GermanyW. Europe3561320 or higher
2MexicoN. Americana18
3MozambiqueAfrica85516
4RussiaC&E Europe1613917
5UkraineC&E Europe727116
\n", "
" ], "text/plain": [ " country region gdppcap08 polityIV\n", "0 Brazil S. America 10296 18\n", "1 Germany W. Europe 35613 20 or higher\n", "2 Mexico N. America na 18\n", "3 Mozambique Africa 855 16\n", "4 Russia C&E Europe 16139 17\n", "5 Ukraine C&E Europe 7271 16" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "pd.set_option(\"mode.copy_on_write\", True)\n", "\n", "world = pd.read_csv(\"data/world-very-small-type-errors.csv\")\n", "world" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously we can see there are going to be issues with `gdppcap08` and `polityIV`, but this were a real (big) dataset where this wasn't evident and you tried to calculate the average GDP of countries in the data. You would get a *very* weird `ValueError` / `TypeError` that might look something like this (I'm removing most of the stuff in the middle for sanity):\n", "\n", "```python\n", "world[\"gdppcap08\"].mean()\n", "\n", "---------------------------------------------------------------------------\n", "ValueError Traceback (most recent call last)\n", "File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/nanops.py:1622, in _ensure_numeric(x)\n", " 1621 try:\n", "-> 1622 x = float(x)\n", " 1623 except (TypeError, ValueError):\n", " 1624 # e.g. \"1+1j\" or \"foo\"\n", "\n", "ValueError: could not convert string to float: '1029635613na855161397271'\n", "\n", "[...]\n", "\n", "File ~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/nanops.py:1629, in _ensure_numeric(x)\n", " 1626 x = complex(x)\n", " 1627 except ValueError as err:\n", " 1628 # e.g. \"foo\"\n", "-> 1629 raise TypeError(f\"Could not convert {x} to numeric\") from err\n", " 1630 return x\n", "\n", "TypeError: Could not convert 1029635613na855161397271 to numeric\n", "```\n", "\n", "What does all this mean? In short, it means that when pandas tried to run `mean()` it realized the method would only work if the data were numeric, so it tried to convert everything in the Series to a numeric type, but it found something in the Series it couldn't convert. Why is the error `TypeError: Could not convert 1029635613na855161397271 to numeric` and not `TypeError: Could not convert na to numeric`? Honestly, I don't have the foggiest notion, but that's what I got with my current version of pandas so I'm gonna leave it in this lesson so you know how weird these things can get!\n", "\n", "But the key take-away is this: what we *can* see is that there was a failure to convert a value to numeric. And that means there's something in our Series that can't be parsed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Finding Problematic Observations\n", "\n", "So the first thing we can do, as a sanity check, is check the `dtypes` of our dataframe to confirm that what we *thought* was a numeric column is in fact an `object` or `Categorical` column:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country object\n", "region object\n", "gdppcap08 object\n", "polityIV object\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops! Yup, there it is! Indeed, we can also see that `polityIV` is also an `object`, something we'll deal with later. \n", "\n", "So this kind of phenomenon is such a common problem that pandas has a tool to deal with it: `.str.isnumeric()`. This method returns `True` if an observation can be converted to a number without issues, and `False` otherwise, so by using its logical negation, we can find all observations that aren't convertible (which we may then need to fix):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gdppcap08\n", "na 1\n", "Name: count, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world.loc[~world[\"gdppcap08\"].str.isnumeric(), \"gdppcap08\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Recall we have to use `~` instead of `not` to invert `True` values to `False` in a numpy array or pandas Series)\n", "\n", "There we are! One observation of `\"na\"` is the source of our problems. Note that you could also use `.unique()` instead of `.value_counts()` if you don't care about the number of observations that are causing problems. \n", "\n", "OK, so how do we fix this? In this case, it seems like `na` is probably just short for `np.nan`, so we can replace it with `np.nan`, which *is* a number:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "world.loc[world[\"gdppcap08\"] == \"na\", \"gdppcap08\"] = np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can either cast `gdppcap08` to a numeric type directly with `world[\"gdppcap08\"] = world[\"gdppcap08\"].astype(\"float\")`, or let pandas do it implicitly when we run `world[\"gdppcap08\"].mean()`. In general the former is probably better practice, so let's do that:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14034.8" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world[\"gdppcap08\"] = world[\"gdppcap08\"].astype(\"float\")\n", "world[\"gdppcap08\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, there's one other option in these situations: if `na` is appearing in lots of places in your data as a representation for missing data, we can also communicate this fact to `read_csv`. Remember when we said that `read_csv` has more options than you could ever imagine? Well, one is `na_values`, where you can specify how missing data is represented in a dataset. If we use that to tell `read_csv` that `\"na\"` means data is missing, it'll make this correction on the fly!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregiongdppcap08polityIV
0BrazilS. America10296.018
1GermanyW. Europe35613.020 or higher
2MexicoN. AmericaNaN18
3MozambiqueAfrica855.016
4RussiaC&E Europe16139.017
5UkraineC&E Europe7271.016
\n", "
" ], "text/plain": [ " country region gdppcap08 polityIV\n", "0 Brazil S. America 10296.0 18\n", "1 Germany W. Europe 35613.0 20 or higher\n", "2 Mexico N. America NaN 18\n", "3 Mozambique Africa 855.0 16\n", "4 Russia C&E Europe 16139.0 17\n", "5 Ukraine C&E Europe 7271.0 16" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world = pd.read_csv(\"data/world-very-small-type-errors.csv\", na_values=\"na\")\n", "world" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country object\n", "region object\n", "gdppcap08 float64\n", "polityIV object\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world.dtypes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14034.8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world[\"gdppcap08\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Magic!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deliberate Non-Numerics\n", "\n", "While in the example above `\"na\"` wasn't really meant to be interpreted as a string, sometimes non-numeric values are inserted into datasets to communicate something important to the user. For example, in this toy example, `polityIV` has a non-numeric value:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryregiongdppcap08polityIV
1GermanyW. Europe35613.020 or higher
\n", "
" ], "text/plain": [ " country region gdppcap08 polityIV\n", "1 Germany W. Europe 35613.0 20 or higher" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world.loc[~world[\"polityIV\"].str.isnumeric()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the value is to indicate that the value of `polityIV` could be 20 *or any value above 20*. Why might this occur? In this case, I've made it up, but in many surveys, there will be a maximum value allowed (e.g., some surveys have a top income value they can record). These are called \"top-codes\", and they are different from the case above because it's not immediately clear what value you should put into the field. You *could* put in 20, but... the value might not be 20, it could be 21!\n", "\n", "So when you find a top-code like this, you need to think carefully about what to do with it, as the answer will depend on the type of analysis you want to do. But the answer will almost always lie not in the data, but in the documentation for the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "- Sometimes data you think should be numeric won't be recognized as numeric by pandas.\n", "- If you try and do a numeric operation on these columns, you'll get a `TypeError` and/or `ValueError`, and usually a note about an inability to convert some value to numeric.\n", "- You can confirm that pandas sees a column as non-numeric with `.dtypes`.\n", "- You can find the non-numeric values with `df[~ df[\"column\"].str.isnumeric(), \"column\"].value_counts()`. \n", "- You can then replace these values with numeric values if it's clear how to do so and recast the column to numeric with `df[\"column\"] = df[\"column\"].astype(\"float\")`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.7 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "3e0a5228cb9726a24d36227c69ed0d3aac98cecda769d1c9adb080711d57f90d" } } }, "nbformat": 4, "nbformat_minor": 4 }