Combining datasets: joins

Combining datasets: joins#

In the last lesson, we saw how concatenate can be used to append rows or columns to a DataFrame. However, we also saw the potential pitfalls where there are data mismatches between the two datasets. In this lesson, we’ll explore merge and how to bring datasets together in ways the factor in the existing content of the DataFrame.

Suppose…

  • You are analyzing the accounts of a firm and were given one dataset with that contained transactions with the corresponding dollar amounts and vendor ID number and a second dataset that contained the vendor ID and the name and industry of the vendor. How could you determine which vendor corresponds to each transaction?

  • You are developing a system to compute the credits that each student received from taking classes. You have one dataset that has the students and the name of the class they’re taking, and a separate dataset that list the classes and the number of credits each class provides. How do you determine the number of credits each student is taking?

  • You are on the marketing team where each employee is signed to one of the clients of the firm. In most cases, multiple employees are signed to a single client (forming the client team) and those employees may, themselves work with multiple clients (they are part of multiple teams). Each client has one or more products. How do we get a list of each product that each employee is working on?

We can address each of the above issues by combining datasets using merge. In fact, we’ll see that each of the above examples represents a different type of merge (one-to-one, one-to-many, many-to-many, respectively).

To help us with this discussion, let’s again create two DataFrames that we want to combine; these will be a little different this time:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

dfa = pd.DataFrame(
    {
        "animals": ["dog", "cat", "bird", "fish"],
        "location": ["land", "land", "air", "water"],
        "has_fur": [True, True, False, False],
    }
)
dfa
animals location has_fur
0 dog land True
1 cat land True
2 bird air False
3 fish water False
dfb = pd.DataFrame(
    {"animals": ["fish", "elephant", "blue whale"], "avg_weight": [4, 12000, 300000]}
)
dfb
animals avg_weight
0 fish 4
1 elephant 12000
2 blue whale 300000

There are a few differences to note here. First, there is one shared column in each DataFrame and that is ‘animals’; the rest are unique. Also, while most of the animals are distinct between these two datasets. To that end, there are 3 ways in which we could merge these DataFrames together referring to the uniqueness of the matches between one dataset and the other:

  1. One-to-one. Each entry in a column of one dataset matches to at most one entry in the column of the second dataset (also, the merge keys are unique in both datasets). This is the simplest case to encounter.

  2. One-to-many. For the column to merge on, there may only be one row with each unique value in that column in one of the datasets, while there may be multiple values in the corresponding column in the other dataset.

  3. Many-to-many. There may be multiple rows containing the same value in the merge column for each of the datasets.

Each of these above situations can be handled with the merge method. The types of joins are illustrated graphically below:

Types of Joins

The above examples each work exactly as written. You’ll notice two parameters, “how” and “on”. The parameter “on” specifies which column to compare in each of the two DataFrames, assuming they are named the same in both (if not, you can use “left_on” and “right_on” to specify them separately). The parameter “how” specifies how the merge is to be performed, but we will discuss that more later in this lesson; for now, we use the “left” join, and just know this means we’re always using the left DataFrame as out point of reference for comparing columns. Let’s verify that each of the above merges work as expected. First, the one-to-one:

a = pd.DataFrame(
    data={
        "name": ["Mia", "Lucas", "Ang", "Jia"],
        "course": ["math", "chem", "econ", "chem"],
    }
)
a
name course
0 Mia math
1 Lucas chem
2 Ang econ
3 Jia chem
b = pd.DataFrame(
    data={"name": ["Lucas", "Ang", "Jia", "Mia"], "grade": ["B-", "B", "A-", "B+"]}
)
b
name grade
0 Lucas B-
1 Ang B
2 Jia A-
3 Mia B+
merged = pd.merge(a, b, how="left", on="name")
merged
name course grade
0 Mia math B+
1 Lucas chem B-
2 Ang econ B
3 Jia chem A-

Next the one-to-many:

c = pd.DataFrame(data={"course": ["chem", "econ", "math"], "credits": [4, 3, 1.5]})
c
course credits
0 chem 4.0
1 econ 3.0
2 math 1.5
merged = pd.merge(c, a, how="left", on="course")
merged
course credits name
0 chem 4.0 Lucas
1 chem 4.0 Jia
2 econ 3.0 Ang
3 math 1.5 Mia

And finally, the many-to-many:

d = pd.DataFrame(
    data={
        "course": ["chem", "chem", "econ", "math"],
        "topic": ["atoms", "lab", "trade", "calc."],
    }
)
d
course topic
0 chem atoms
1 chem lab
2 econ trade
3 math calc.
merged = pd.merge(a, d, how="left", on="course")
merged
name course topic
0 Mia math calc.
1 Lucas chem atoms
2 Lucas chem lab
3 Ang econ trade
4 Jia chem atoms
5 Jia chem lab

Each of these are very clean examples of the three types of joins: one-to-one, one-to-many, and many-to-many. Incomplete data are common, however, when either a or b are missing some of the content that would otherwise allow for a match. We will explore how to handle these situations with merge in the next lesson.