Merging Data#

In this section we investigate multiple ways to merge two DataFrames depending on the information we want to combine.

The merge function is the most versatile pandas feature for combining DataFrames. The general syntax to merge two DataFrames is:

pd.merge(df_left, df_right)

or alternately, dependent on preference, the below format can be used.

df_left.merge(df_right),

where

  • df_left and df_right are the two dataframes of interest.

By default, this joins common column names and takes all common rows to make up the combined version, preserving the order of the left DataFrame. Returning a combined DataFrame that contains only the matching rows is called an inner join; this is also the intersection of the two DataFrames.

To illustrate, we consider the following two DataFrames containing nutritional information and pricing on a subset of McDonald’s menu items. To easily analyze this information, it is useful to have it contained in one DataFrame.

The information used to create these DataFrames comes from a subset of the nutritional information, found on Kaggle and from McDonalds, in addition to price list data containing the average McDonald’s menu prices in the U.S. from 4338 locations.

nutritional_info = pd.DataFrame({
    'Item': ['Big Mac', 'Medium French Fries', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Sausage Burrito', 'Baked Apple Pie'],
    'Calories': [530, 340, 290, 360, 330, 300, 250],
    'Protein': [24, 4, 15, 14, 8, 12, 2],
})

nutritional_info
Item Calories Protein
0 Big Mac 530 24
1 Medium French Fries 340 4
2 Cheeseburger 290 15
3 McChicken 360 14
4 Hot Fudge Sundae 330 8
5 Sausage Burrito 300 12
6 Baked Apple Pie 250 2
menu = pd.DataFrame({
    'Item': ['Big Mac', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Egg McMuffin', 'Medium French Fries', 'Sausage Burrito'],
    'Price': [5.47, 1.81, 2.01, 2.68, 4.43, 2.77, 2.10],
    'Category': ['Lunch/Dinner', 'Lunch/Dinner', 'Lunch/Dinner', 'Dessert', 'Breakfast', 'Lunch/Dinner', 'Breakfast'],
})

menu
Item Price Category
0 Big Mac 5.47 Lunch/Dinner
1 Cheeseburger 1.81 Lunch/Dinner
2 McChicken 2.01 Lunch/Dinner
3 Hot Fudge Sundae 2.68 Dessert
4 Egg McMuffin 4.43 Breakfast
5 Medium French Fries 2.77 Lunch/Dinner
6 Sausage Burrito 2.10 Breakfast

We can combine the nutritional information with the menu using the merge function.

This automatically merges on the shared column name Item. Even though the row order of the two DataFrames is not the same, the merge recognizes this and pairs the rows in the correct order. In fact, by default the ordering of the left DataFrame input is kept – nutritional_info.

combined_menu = pd.merge(nutritional_info, menu)

combined_menu
Item Calories Protein Price Category
0 Big Mac 530 24 5.47 Lunch/Dinner
1 Medium French Fries 340 4 2.77 Lunch/Dinner
2 Cheeseburger 290 15 1.81 Lunch/Dinner
3 McChicken 360 14 2.01 Lunch/Dinner
4 Hot Fudge Sundae 330 8 2.68 Dessert
5 Sausage Burrito 300 12 2.10 Breakfast

Merge Options#

If we want to combine DataFrames in a different way, other than the default intersection of the given DataFrames, we can change the how argument. We specify how = left or how = right to include information contained only in one DataFrame. The left or right option will preserve all rows of the left or right DataFrame, respectively. Any information not present will be labeled as NaN – “not a number.”

Different options to merge include: left, right, inner, and outer.

We summarize the merge options below.

../../_images/Merge_options.png

We see below that merging on the right DataFrame includes all rows in the menu DataFrame including the item “Egg McMuffin”. Since this item does not have calorie or protein information from the nutritional_info DataFrame, its entry is NaN in the merged DataFrame.

menu_right = pd.merge(nutritional_info, menu, how='right')

menu_right
Item Calories Protein Price Category
0 Big Mac 530.0 24.0 5.47 Lunch/Dinner
1 Cheeseburger 290.0 15.0 1.81 Lunch/Dinner
2 McChicken 360.0 14.0 2.01 Lunch/Dinner
3 Hot Fudge Sundae 330.0 8.0 2.68 Dessert
4 Egg McMuffin NaN NaN 4.43 Breakfast
5 Medium French Fries 340.0 4.0 2.77 Lunch/Dinner
6 Sausage Burrito 300.0 12.0 2.10 Breakfast

While the default inner merge, as mentioned above, contains the intersection of all rows of both DataFrames, the outer merge contains all the row entries from both DataFrames. The inner merge ensures all entries are meaningful in the combined DataFrame, whereas the outer merge may contain NaN entries, which relay that no information is present.

menu_outer = pd.merge(nutritional_info, menu, how='outer')

menu_outer
Item Calories Protein Price Category
0 Big Mac 530.0 24.0 5.47 Lunch/Dinner
1 Medium French Fries 340.0 4.0 2.77 Lunch/Dinner
2 Cheeseburger 290.0 15.0 1.81 Lunch/Dinner
3 McChicken 360.0 14.0 2.01 Lunch/Dinner
4 Hot Fudge Sundae 330.0 8.0 2.68 Dessert
5 Sausage Burrito 300.0 12.0 2.10 Breakfast
6 Baked Apple Pie 250.0 2.0 NaN NaN
7 Egg McMuffin NaN NaN 4.43 Breakfast

Merging One to Many#

Notice that in merging nutritional information with menu information in the example above, each row entry is paired with at most one row entry in the resulting combined DataFrame. That is, the Cheeseburger’s nutritional information is paired with the Cheeseburger’s price and no other pair is made with the Cheeseburger. It is also common to merge columns that contain multiple or repeated entries.

For example, suppose we introduce an additional DataFrame containing information about when each type of meal is served:

times = pd.DataFrame({
    'Meal': ['Breakfast', 'Lunch/Dinner', 'Dessert'],
    'Time Served': ['5am-11am', '11am-5am', 'Ice Cream Machine Broken'],
})

times
Meal Time Served
0 Breakfast 5am-11am
1 Lunch/Dinner 11am-5am
2 Dessert Ice Cream Machine Broken

We see that the “Lunch/Dinner” option as well as the “Breakfast” option are repeated in menu_right, so the resulting merge of times and menu_right will repeat these in multiple rows as necessary.

Differing Column Names#

We also encounter that the column names are not the same between the two DataFrames. We can still merge times with menu_right by specifying which column is our merge column from the left and right DataFrames. From the left DataFrame – menu_right – we’ll choose the column “Category” and from the right – times – we’ll choose the column “Meal”.

The left_on and right_on arguments in the merge call are used to specify which column name we want to merge on from the corresponding DataFrames.

menu_times = pd.merge(menu_right, times, left_on='Category', right_on='Meal')

menu_times
Item Calories Protein Price Category Meal Time Served
0 Big Mac 530.0 24.0 5.47 Lunch/Dinner Lunch/Dinner 11am-5am
1 Cheeseburger 290.0 15.0 1.81 Lunch/Dinner Lunch/Dinner 11am-5am
2 McChicken 360.0 14.0 2.01 Lunch/Dinner Lunch/Dinner 11am-5am
3 Medium French Fries 340.0 4.0 2.77 Lunch/Dinner Lunch/Dinner 11am-5am
4 Hot Fudge Sundae 330.0 8.0 2.68 Dessert Dessert Ice Cream Machine Broken
5 Egg McMuffin NaN NaN 4.43 Breakfast Breakfast 5am-11am
6 Sausage Burrito 300.0 12.0 2.10 Breakfast Breakfast 5am-11am

Notice that this merge resulted in the columns “Category” and “Meal”, which contain the same information.

We’ll revise this merged menu_times to get rid of the duplicate columns with the pandas drop function. This allows us to specify the column, or row, we want to remove from the DataFrame. To remove the column “Meal” we specify this argument, and set the keyword argument axis to either the value 'columns' or 1. (To remove a row we would set axis=0, which is the default.)

menu_times_revised = menu_times.drop('Meal', axis='columns')

menu_times_revised
Item Calories Protein Price Category Time Served
0 Big Mac 530.0 24.0 5.47 Lunch/Dinner 11am-5am
1 Cheeseburger 290.0 15.0 1.81 Lunch/Dinner 11am-5am
2 McChicken 360.0 14.0 2.01 Lunch/Dinner 11am-5am
3 Medium French Fries 340.0 4.0 2.77 Lunch/Dinner 11am-5am
4 Hot Fudge Sundae 330.0 8.0 2.68 Dessert Ice Cream Machine Broken
5 Egg McMuffin NaN NaN 4.43 Breakfast 5am-11am
6 Sausage Burrito 300.0 12.0 2.10 Breakfast 5am-11am