{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Queries reprised: asking your data questions\n", "\n", "Much of this week has focused on data preparation: combining datasets together, reconfiguring and aggregating them into groups, all for the purpose of preparing the data to help us develop insights and ask questions. In this lesson, we'll be returning to how we can programmatically ask questions of our datasets using queries. We have seen all of this before when we discussed subsetting our data and filtering based on different logical expressions. Here we will introduce a compact way to execute such queries to add another tool into your programmatic toolbox for working with data. A few of you may have encountered structured query language (SQL) before, and if so, there will be many connections to what we discuss today.\n", "\n", "Let's start with our sales data from a previous lesson:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
1Guanyu172019
2Jan62020
3Roman122018
4Jacqueline82020
5Paola32019
6Esperanza72019
7Alaina152020
8Egweyn52020
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "1 Guanyu 17 2019\n", "2 Jan 6 2020\n", "3 Roman 12 2018\n", "4 Jacqueline 8 2020\n", "5 Paola 3 2019\n", "6 Esperanza 7 2019\n", "7 Alaina 15 2020\n", "8 Egweyn 5 2020" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"mode.copy_on_write\", True)\n", "\n", "sales = pd.DataFrame(\n", " data={\n", " \"employee\": [\n", " \"Katrina\",\n", " \"Guanyu\",\n", " \"Jan\",\n", " \"Roman\",\n", " \"Jacqueline\",\n", " \"Paola\",\n", " \"Esperanza\",\n", " \"Alaina\",\n", " \"Egweyn\",\n", " ],\n", " \"sales\": [14, 17, 6, 12, 8, 3, 7, 15, 5],\n", " \"year\": [2018, 2019, 2020, 2018, 2020, 2019, 2019, 2020, 2020],\n", " }\n", ")\n", "sales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To query a `pandas` DataFrame, we need to use the `query` method which takes a query string. This string allows you to ask a world of questions of your data for example, see the table below for a set of examples of quieries and their corresponding query strings for our `sales` dataset:\n", "\n", "| Query | Query string |\n", "| ----------- | ----------- |\n", "| Show me sales greater than 10 | \"sales > 10\" |\n", "| Show me data from 2018 | \"year == 2018\" |\n", "| Show me sales are greater than 13 and the year is 2018| \"sales > 13 and year == 2018\" |\n", "| Show me everything EXCEPT for when sales are greater than 13 and the year is 2018 | \"not (sales > 13 and year == 2018)\" |\n", "| Show me data where sales divided by 3 are greater than 3 | \"sales/3 > 3\" |\n", "| Show me employees whose names are alphabetically after J | \"employee > 'J'\" |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's show each of these queries in action below:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
1Guanyu172019
3Roman122018
7Alaina152020
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "1 Guanyu 17 2019\n", "3 Roman 12 2018\n", "7 Alaina 15 2020" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query(\"sales > 10\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
3Roman122018
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "3 Roman 12 2018" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query(\"year == 2018\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query(\"sales > 13 and year == 2018\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
1Guanyu172019
2Jan62020
3Roman122018
4Jacqueline82020
5Paola32019
6Esperanza72019
7Alaina152020
8Egweyn52020
\n", "
" ], "text/plain": [ " employee sales year\n", "1 Guanyu 17 2019\n", "2 Jan 6 2020\n", "3 Roman 12 2018\n", "4 Jacqueline 8 2020\n", "5 Paola 3 2019\n", "6 Esperanza 7 2019\n", "7 Alaina 15 2020\n", "8 Egweyn 5 2020" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query(\"not (sales > 13 and year == 2018)\")" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
1Guanyu172019
3Roman122018
7Alaina152020
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "1 Guanyu 17 2019\n", "3 Roman 12 2018\n", "7 Alaina 15 2020" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query(\"sales/3 > 3\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
2Jan62020
3Roman122018
4Jacqueline82020
5Paola32019
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "2 Jan 6 2020\n", "3 Roman 12 2018\n", "4 Jacqueline 8 2020\n", "5 Paola 3 2019" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.query('employee > \"J\"')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, these same queries could have been accomplished using selection and filtering concept we discussed earlier, as shown below, but requiring more sytax (and the `query` method is often more computationally efficient). Let's try rewriting `sales.query('sales > 13 and year == 2018')` using the techniques we described earlier:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales[(sales[\"sales\"] > 13) & (sales[\"year\"] == 2018)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `query` method makes our code far easier to follow.\n", "\n", "Another nice feature of `query` is that we can bring in variables from our local workspace, but outside of the DataFrame by prefixing the `@` symbol to the variable name. Let's say we knew that each sale was worth $1,000. Then we could query based on the total dollar amount of sales as follows:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeesalesyear
0Katrina142018
1Guanyu172019
3Roman122018
7Alaina152020
\n", "
" ], "text/plain": [ " employee sales year\n", "0 Katrina 14 2018\n", "1 Guanyu 17 2019\n", "3 Roman 12 2018\n", "7 Alaina 15 2020" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "revenue_per_sale = 1000\n", "sales.query(\"sales * @revenue_per_sale > 10000\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computation with columns\n", "\n", "In a similar way that we were able to create more readable and efficient queries with the `query` method, `pandas` also has a faster, more readable method for performing a number of computations on columns. let's create a DataFrame with some data to demonstrate:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_initialspeed_finaltime_initialtime_final
00.2710650.7208020.7962690.433519
10.3120850.3492550.2581580.833784
20.9863990.1296300.0591270.246308
30.5424440.2955940.2500110.102992
40.2264100.6708190.3883600.600059
\n", "
" ], "text/plain": [ " speed_initial speed_final time_initial time_final\n", "0 0.271065 0.720802 0.796269 0.433519\n", "1 0.312085 0.349255 0.258158 0.833784\n", "2 0.986399 0.129630 0.059127 0.246308\n", "3 0.542444 0.295594 0.250011 0.102992\n", "4 0.226410 0.670819 0.388360 0.600059" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "rand_matrix = np.random.rand(5, 4)\n", "data = pd.DataFrame(\n", " rand_matrix, columns=[\"speed_initial\", \"speed_final\", \"time_initial\", \"time_final\"]\n", ")\n", "data_copy = data.copy() # Make a copy to use later\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In physics, we can calculate acceleration of an object by it's change in velocity divided by the time it took to change that velocity. Let's calculate that using the tools we've discussed alread:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_initialspeed_finaltime_initialtime_finalacceleration
00.2710650.7208020.7962690.433519-1.239797
10.3120850.3492550.2581580.8337840.064573
20.9863990.1296300.0591270.246308-4.577205
30.5424440.2955940.2500110.1029921.679045
40.2264100.6708190.3883600.6000592.099245
\n", "
" ], "text/plain": [ " speed_initial speed_final time_initial time_final acceleration\n", "0 0.271065 0.720802 0.796269 0.433519 -1.239797\n", "1 0.312085 0.349255 0.258158 0.833784 0.064573\n", "2 0.986399 0.129630 0.059127 0.246308 -4.577205\n", "3 0.542444 0.295594 0.250011 0.102992 1.679045\n", "4 0.226410 0.670819 0.388360 0.600059 2.099245" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"acceleration\"] = (data[\"speed_final\"] - data[\"speed_initial\"]) / (\n", " data[\"time_final\"] - data[\"time_initial\"]\n", ")\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead we can use the `eval` function, which has a syntax similar to that of `query`, but lets us do a number of common computations more clearly (and more quickly, since the code will run more quickly as well):" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_initialspeed_finaltime_initialtime_finalacceleration
00.2710650.7208020.7962690.433519-1.239797
10.3120850.3492550.2581580.8337840.064573
20.9863990.1296300.0591270.246308-4.577205
30.5424440.2955940.2500110.1029921.679045
40.2264100.6708190.3883600.6000592.099245
\n", "
" ], "text/plain": [ " speed_initial speed_final time_initial time_final acceleration\n", "0 0.271065 0.720802 0.796269 0.433519 -1.239797\n", "1 0.312085 0.349255 0.258158 0.833784 0.064573\n", "2 0.986399 0.129630 0.059127 0.246308 -4.577205\n", "3 0.542444 0.295594 0.250011 0.102992 1.679045\n", "4 0.226410 0.670819 0.388360 0.600059 2.099245" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_copy.eval(\n", " \"acceleration = (speed_final - speed_initial) / (time_final - time_initial)\",\n", " inplace=True,\n", ")\n", "data_copy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we could prefix a local variable name with the `@` symbol to incorporate the variable into the `eval` expression. While not appropriate for cases which use complex functions, `eval` is a helpful tool for simple expressions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "While none of the ends that were achieved by the tools in this section were unable to be reached using the tools from earlier parts of this course, the `query` and `eval` tools discussed here help to make that process easier to understand, easier to implement, and more efficient for processing." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.2 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "c97f18ea1c0f4969cf594a5df9f14ba2a838cf106fc5300ddecce29d9d6f0c71" } } }, "nbformat": 4, "nbformat_minor": 2 }