Okay, we're going to start by merging the orders and the products DataFrames. To do that, I'm going to start with a new variable called df_merged. So, I've got a new DataFrame called df_merged, and we're going to keep adding different DataFrames to it.
Firstly, I'm going to start with df_orders, which is considered the left table as I'm referencing it first. Then, I'm going to use the merge function. My first parameter is going to be the other DataFrame that I want to merge with df_orders, which is df_products. For my next parameter, I specify how I want to join it, and this is going to be a left join. Then, my third parameter is going to be on, where I specify the common column. In this case, it is product_id. We have product_id in both DataFrames, so I'll put product_id, close that string, and then I'm going to return df_merged so we can check it. I'll just show three rows of data.
Okay, so scroll down, and you can see we have the order_date, order_customer_id, product_id, discount, quantity, and date. Then, we have the product_name, brand, category, subcategory_1, subcategory_2, unit_price, and unit_cost. You can see we've merged those two DataFrames into one new DataFrame named df_merged.
Now, for this next step, we're going to add the customers DataFrame to the merged DataFrame, but only add the customer and salesperson_id columns. We don't want to bring in the location columns like postcode, city, country, and region. In the previous step, we brought through all the columns from the right-hand table. In this next step, we want to only bring in specific columns.
I'll start by giving an example of what's not going to work. If we take the same approach as we did in the last step, referencing df_merged, and try to merge df_customers, specifying how to join it, we might get an error. The issue here is that the common column between the two is customer_key in one table and customer_id in the other. They've got different names, so we can't just reference customer_id because it doesn't recognize that in the customers table. We need to make a change here.
So, I'll get rid of that and amend it to use left_on and right_on. Here, we specify the name of the common column in the left-hand table, which is the merged table (effectively this), so that's customer_id. Then, we add another parameter, right_on, which is the column name from the right-hand table (the customers table), and that's customer_key.
Now, when we return that, we'll get a result without errors. We'll see we've got the merged DataFrame up to unit_cost, and then we bring through the customer_key and all other columns from the customers table. But that's not quite what we wanted. We didn't want to bring in the location data, so we can amend this by specifying only the columns we want to bring through from the df_customers DataFrame.
We'll start with square brackets and put a list of the columns we need. We need to bring through customer_key for the join, although we can subsequently drop it from the DataFrame. So, we'll start with customer_key, then bring through customer and salesperson_id.
So now, we have merged the DataFrames and brought in only the specified columns. We're doing really well, and there's one DataFrame left to bring through, which is the employees DataFrame. The next step asks us to add the employees DataFrame to the merge, only adding the first_name and second_name columns. We also need to remove the ID and key columns.
We'll use the previous technique to specify the columns. We'll start by copying the previous step, then change some DataFrame names, variable names, and column names. We'll keep df_merged and merge it with the employees table. We're joining the salesperson_id column with the employee_id column, so the left_on parameter will be salesperson_id, and the right_on parameter will be employee_id.
We'll specify which columns we want to bring through: employee_id, first_name, and second_name. Now, we'll have salesperson_id, employee_id, first_name, and second_name.
To remove the ID columns, we'll list the columns to drop: customer_id, product_id, customer_key, salesperson_id, and employee_id. We'll make these changes to the df_merged DataFrame.
So, we've successfully merged DataFrames using a left join, overcome issues with common columns having different names, brought in specific columns, and dropped unwanted columns. From all these DataFrames, we now have a df_merged version that contains only the relevant data, without the ID and key columns, and without the location data or job title and grade of the employees or salespeople.