Queries reprised: asking your data questions

Queries reprised: asking your data questions#

Much of this week has focused on data preparation: combining datasets together, reconfiguring and aggregating them into groups, all for the purpose of preparing the data to help us develop insights and ask questions. In this lesson, we’ll be returning to how we can programmatically ask questions of our datasets using queries. We have seen all of this before when we discussed subsetting our data and filtering based on different logical expressions. Here we will introduce a compact way to execute such queries to add another tool into your programmatic toolbox for working with data. A few of you may have encountered structured query language (SQL) before, and if so, there will be many connections to what we discuss today.

Let’s start with our sales data from a previous lesson:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

sales = pd.DataFrame(
    data={
        "employee": [
            "Katrina",
            "Guanyu",
            "Jan",
            "Roman",
            "Jacqueline",
            "Paola",
            "Esperanza",
            "Alaina",
            "Egweyn",
        ],
        "sales": [14, 17, 6, 12, 8, 3, 7, 15, 5],
        "year": [2018, 2019, 2020, 2018, 2020, 2019, 2019, 2020, 2020],
    }
)
sales

	employee	sales	year
0	Katrina	14	2018
1	Guanyu	17	2019
2	Jan	6	2020
3	Roman	12	2018
4	Jacqueline	8	2020
5	Paola	3	2019
6	Esperanza	7	2019
7	Alaina	15	2020
8	Egweyn	5	2020

To query a pandas DataFrame, we need to use the query method which takes a query string. This string allows you to ask a world of questions of your data for example, see the table below for a set of examples of quieries and their corresponding query strings for our sales dataset:

Query	Query string
Show me sales greater than 10	“sales > 10”
Show me data from 2018	“year == 2018”
Show me sales are greater than 13 and the year is 2018	“sales > 13 and year == 2018”
Show me everything EXCEPT for when sales are greater than 13 and the year is 2018	“not (sales > 13 and year == 2018)”
Show me data where sales divided by 3 are greater than 3	“sales/3 > 3”
Show me employees whose names are alphabetically after J	“employee > ‘J’”

Let’s show each of these queries in action below:

sales.query("sales > 10")

	employee	sales	year
0	Katrina	14	2018
1	Guanyu	17	2019
3	Roman	12	2018
7	Alaina	15	2020

sales.query("year == 2018")

	employee	sales	year
0	Katrina	14	2018
3	Roman	12	2018

sales.query("sales > 13 and year == 2018")

	employee	sales	year
0	Katrina	14	2018

sales.query("not (sales > 13 and year == 2018)")

	employee	sales	year
1	Guanyu	17	2019
2	Jan	6	2020
3	Roman	12	2018
4	Jacqueline	8	2020
5	Paola	3	2019
6	Esperanza	7	2019
7	Alaina	15	2020
8	Egweyn	5	2020

sales.query("sales/3 > 3")

	employee	sales	year
0	Katrina	14	2018
1	Guanyu	17	2019
3	Roman	12	2018
7	Alaina	15	2020

sales.query('employee > "J"')

	employee	sales	year
0	Katrina	14	2018
2	Jan	6	2020
3	Roman	12	2018
4	Jacqueline	8	2020
5	Paola	3	2019

Now, these same queries could have been accomplished using selection and filtering concept we discussed earlier, as shown below, but requiring more sytax (and the query method is often more computationally efficient). Let’s try rewriting sales.query('sales > 13 and year == 2018') using the techniques we described earlier:

sales[(sales["sales"] > 13) & (sales["year"] == 2018)]

	employee	sales	year
0	Katrina	14	2018

The query method makes our code far easier to follow.

Another nice feature of query is that we can bring in variables from our local workspace, but outside of the DataFrame by prefixing the @ symbol to the variable name. Let’s say we knew that each sale was worth $1,000. Then we could query based on the total dollar amount of sales as follows:

revenue_per_sale = 1000
sales.query("sales * @revenue_per_sale > 10000")

	employee	sales	year
0	Katrina	14	2018
1	Guanyu	17	2019
3	Roman	12	2018
7	Alaina	15	2020

Computation with columns#

In a similar way that we were able to create more readable and efficient queries with the query method, pandas also has a faster, more readable method for performing a number of computations on columns. let’s create a DataFrame with some data to demonstrate:

import numpy as np

rand_matrix = np.random.rand(5, 4)
data = pd.DataFrame(
    rand_matrix, columns=["speed_initial", "speed_final", "time_initial", "time_final"]
)
data_copy = data.copy()  # Make a copy to use later
data

	speed_initial	speed_final	time_initial	time_final
0	0.271065	0.720802	0.796269	0.433519
1	0.312085	0.349255	0.258158	0.833784
2	0.986399	0.129630	0.059127	0.246308
3	0.542444	0.295594	0.250011	0.102992
4	0.226410	0.670819	0.388360	0.600059

In physics, we can calculate acceleration of an object by it’s change in velocity divided by the time it took to change that velocity. Let’s calculate that using the tools we’ve discussed alread:

data["acceleration"] = (data["speed_final"] - data["speed_initial"]) / (
    data["time_final"] - data["time_initial"]
)
data

	speed_initial	speed_final	time_initial	time_final	acceleration
0	0.271065	0.720802	0.796269	0.433519	-1.239797
1	0.312085	0.349255	0.258158	0.833784	0.064573
2	0.986399	0.129630	0.059127	0.246308	-4.577205
3	0.542444	0.295594	0.250011	0.102992	1.679045
4	0.226410	0.670819	0.388360	0.600059	2.099245

Instead we can use the eval function, which has a syntax similar to that of query, but lets us do a number of common computations more clearly (and more quickly, since the code will run more quickly as well):

data_copy.eval(
    "acceleration = (speed_final - speed_initial) / (time_final - time_initial)",
    inplace=True,
)
data_copy

	speed_initial	speed_final	time_initial	time_final	acceleration
0	0.271065	0.720802	0.796269	0.433519	-1.239797
1	0.312085	0.349255	0.258158	0.833784	0.064573
2	0.986399	0.129630	0.059127	0.246308	-4.577205
3	0.542444	0.295594	0.250011	0.102992	1.679045
4	0.226410	0.670819	0.388360	0.600059	2.099245

Similarly, we could prefix a local variable name with the @ symbol to incorporate the variable into the eval expression. While not appropriate for cases which use complex functions, eval is a helpful tool for simple expressions.

Summary#

While none of the ends that were achieved by the tools in this section were unable to be reached using the tools from earlier parts of this course, the query and eval tools discussed here help to make that process easier to understand, easier to implement, and more efficient for processing.

Queries reprised: asking your data questions

Contents

Queries reprised: asking your data questions#

Computation with columns#

Summary#