Merging Data
Contents
Merging Data#
In this section we investigate multiple ways to merge two DataFrames depending on the information we want to combine.
The merge function is the most versatile pandas feature for combining DataFrames. The general syntax to merge two DataFrames is:
pd.merge(df_left, df_right)
or alternately, dependent on preference, the below format can be used.
df_left.merge(df_right),
where
df_left and df_right are the two dataframes of interest.
By default, this joins common column names and takes all common rows to make up the combined version, preserving the order of the left DataFrame. Returning a combined DataFrame that contains only the matching rows is called an inner join; this is also the intersection of the two DataFrames.
To illustrate, we consider the following two DataFrames containing nutritional information and pricing on a subset of McDonald’s menu items. To easily analyze this information, it is useful to have it contained in one DataFrame.
The information used to create these DataFrames comes from a subset of the nutritional information, found on Kaggle and from McDonalds, in addition to price list data containing the average McDonald’s menu prices in the U.S. from 4338 locations.
nutritional_info = pd.DataFrame({
'Item': ['Big Mac', 'Medium French Fries', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Sausage Burrito', 'Baked Apple Pie'],
'Calories': [530, 340, 290, 360, 330, 300, 250],
'Protein': [24, 4, 15, 14, 8, 12, 2],
})
nutritional_info
Item | Calories | Protein | |
---|---|---|---|
0 | Big Mac | 530 | 24 |
1 | Medium French Fries | 340 | 4 |
2 | Cheeseburger | 290 | 15 |
3 | McChicken | 360 | 14 |
4 | Hot Fudge Sundae | 330 | 8 |
5 | Sausage Burrito | 300 | 12 |
6 | Baked Apple Pie | 250 | 2 |
menu = pd.DataFrame({
'Item': ['Big Mac', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Egg McMuffin', 'Medium French Fries', 'Sausage Burrito'],
'Price': [5.47, 1.81, 2.01, 2.68, 4.43, 2.77, 2.10],
'Category': ['Lunch/Dinner', 'Lunch/Dinner', 'Lunch/Dinner', 'Dessert', 'Breakfast', 'Lunch/Dinner', 'Breakfast'],
})
menu
Item | Price | Category | |
---|---|---|---|
0 | Big Mac | 5.47 | Lunch/Dinner |
1 | Cheeseburger | 1.81 | Lunch/Dinner |
2 | McChicken | 2.01 | Lunch/Dinner |
3 | Hot Fudge Sundae | 2.68 | Dessert |
4 | Egg McMuffin | 4.43 | Breakfast |
5 | Medium French Fries | 2.77 | Lunch/Dinner |
6 | Sausage Burrito | 2.10 | Breakfast |
We can combine the nutritional information with the menu using the merge function.
This automatically merges on the shared column name Item
. Even though the row order of the two DataFrames is not the same, the merge recognizes this and pairs the rows in the correct order. In fact, by default the ordering of the left DataFrame input is kept – nutritional_info
.
combined_menu = pd.merge(nutritional_info, menu)
combined_menu
Item | Calories | Protein | Price | Category | |
---|---|---|---|---|---|
0 | Big Mac | 530 | 24 | 5.47 | Lunch/Dinner |
1 | Medium French Fries | 340 | 4 | 2.77 | Lunch/Dinner |
2 | Cheeseburger | 290 | 15 | 1.81 | Lunch/Dinner |
3 | McChicken | 360 | 14 | 2.01 | Lunch/Dinner |
4 | Hot Fudge Sundae | 330 | 8 | 2.68 | Dessert |
5 | Sausage Burrito | 300 | 12 | 2.10 | Breakfast |
Merge Options#
If we want to combine DataFrames in a different way, other than the default intersection of the given DataFrames, we can change the how
argument. We specify how = left or how = right to include information contained only in one DataFrame. The left
or right
option will preserve all rows of the left or right DataFrame, respectively. Any information not present will be labeled as NaN
– “not a number.”
Different options to merge include: left
, right
, inner
, and outer
.
We summarize the merge options below.
We see below that merging on the right DataFrame includes all rows in the menu
DataFrame including the item “Egg McMuffin”. Since this item does not have calorie or protein information from the nutritional_info
DataFrame, its entry is NaN
in the merged DataFrame.
menu_right = pd.merge(nutritional_info, menu, how='right')
menu_right
Item | Calories | Protein | Price | Category | |
---|---|---|---|---|---|
0 | Big Mac | 530.0 | 24.0 | 5.47 | Lunch/Dinner |
1 | Cheeseburger | 290.0 | 15.0 | 1.81 | Lunch/Dinner |
2 | McChicken | 360.0 | 14.0 | 2.01 | Lunch/Dinner |
3 | Hot Fudge Sundae | 330.0 | 8.0 | 2.68 | Dessert |
4 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast |
5 | Medium French Fries | 340.0 | 4.0 | 2.77 | Lunch/Dinner |
6 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast |
While the default inner merge, as mentioned above, contains the intersection of all rows of both DataFrames, the outer merge contains all the row entries from both DataFrames. The inner merge ensures all entries are meaningful in the combined DataFrame, whereas the outer merge may contain NaN
entries, which relay that no information is present.
menu_outer = pd.merge(nutritional_info, menu, how='outer')
menu_outer
Item | Calories | Protein | Price | Category | |
---|---|---|---|---|---|
0 | Big Mac | 530.0 | 24.0 | 5.47 | Lunch/Dinner |
1 | Medium French Fries | 340.0 | 4.0 | 2.77 | Lunch/Dinner |
2 | Cheeseburger | 290.0 | 15.0 | 1.81 | Lunch/Dinner |
3 | McChicken | 360.0 | 14.0 | 2.01 | Lunch/Dinner |
4 | Hot Fudge Sundae | 330.0 | 8.0 | 2.68 | Dessert |
5 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast |
6 | Baked Apple Pie | 250.0 | 2.0 | NaN | NaN |
7 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast |
Merging One to Many#
Notice that in merging nutritional information with menu information in the example above, each row entry is paired with at most one row entry in the resulting combined DataFrame. That is, the Cheeseburger’s nutritional information is paired with the Cheeseburger’s price and no other pair is made with the Cheeseburger. It is also common to merge columns that contain multiple or repeated entries.
For example, suppose we introduce an additional DataFrame containing information about when each type of meal is served:
times = pd.DataFrame({
'Meal': ['Breakfast', 'Lunch/Dinner', 'Dessert'],
'Time Served': ['5am-11am', '11am-5am', 'Ice Cream Machine Broken'],
})
times
Meal | Time Served | |
---|---|---|
0 | Breakfast | 5am-11am |
1 | Lunch/Dinner | 11am-5am |
2 | Dessert | Ice Cream Machine Broken |
We see that the “Lunch/Dinner” option as well as the “Breakfast” option are repeated in menu_right
, so the resulting merge of times
and menu_right
will repeat these in multiple rows as necessary.
Differing Column Names#
We also encounter that the column names are not the same between the two DataFrames. We can still merge times
with menu_right
by specifying which column is our merge column from the left and right DataFrames. From the left DataFrame – menu_right
– we’ll choose the column “Category” and from the right – times
– we’ll choose the column “Meal”.
The left_on and right_on arguments in the merge
call are used to specify which column name we want to merge on from the corresponding DataFrames.
menu_times = pd.merge(menu_right, times, left_on='Category', right_on='Meal')
menu_times
Item | Calories | Protein | Price | Category | Meal | Time Served | |
---|---|---|---|---|---|---|---|
0 | Big Mac | 530.0 | 24.0 | 5.47 | Lunch/Dinner | Lunch/Dinner | 11am-5am |
1 | Cheeseburger | 290.0 | 15.0 | 1.81 | Lunch/Dinner | Lunch/Dinner | 11am-5am |
2 | McChicken | 360.0 | 14.0 | 2.01 | Lunch/Dinner | Lunch/Dinner | 11am-5am |
3 | Medium French Fries | 340.0 | 4.0 | 2.77 | Lunch/Dinner | Lunch/Dinner | 11am-5am |
4 | Hot Fudge Sundae | 330.0 | 8.0 | 2.68 | Dessert | Dessert | Ice Cream Machine Broken |
5 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast | Breakfast | 5am-11am |
6 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast | Breakfast | 5am-11am |
Notice that this merge resulted in the columns “Category” and “Meal”, which contain the same information.
We’ll revise this merged menu_times
to get rid of the duplicate columns with the pandas drop
function. This allows us to specify the column, or row, we want to remove from the DataFrame. To remove the column “Meal” we specify this argument, and set the keyword argument axis
to either the value 'columns'
or 1
. (To remove a row we would set axis=0
, which is the default.)
menu_times_revised = menu_times.drop('Meal', axis='columns')
menu_times_revised
Item | Calories | Protein | Price | Category | Time Served | |
---|---|---|---|---|---|---|
0 | Big Mac | 530.0 | 24.0 | 5.47 | Lunch/Dinner | 11am-5am |
1 | Cheeseburger | 290.0 | 15.0 | 1.81 | Lunch/Dinner | 11am-5am |
2 | McChicken | 360.0 | 14.0 | 2.01 | Lunch/Dinner | 11am-5am |
3 | Medium French Fries | 340.0 | 4.0 | 2.77 | Lunch/Dinner | 11am-5am |
4 | Hot Fudge Sundae | 330.0 | 8.0 | 2.68 | Dessert | Ice Cream Machine Broken |
5 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast | 5am-11am |
6 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast | 5am-11am |