2 minute read

We have two dataframes and a common column that we want to compare and find out the matching, missing values and sometimes the difference between the values using a key

We would first concatenate the two dataframes into one and see how the two dataframes looks side by side and then find out the differences between them.

However if you are interested to find the difference between two dataframes then read this post

We will follow the following steps to find the difference between a column in two dataframes:

  1. Create two dataframes - df1 and df2
  2. Concatenate two dataframes side by side
  3. Compare the column in two dataframes on a common key
  4. Additionally, find the matching rows between two dataframe
  5. find the non-matching rows between the dataframes

Let’s get started, we will first create two test dataframes(df1 & df2) to work upon

Create Two Dataframes

First Dataframe:

The first dataframe has 2 columns: Items and Sale

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['A', 1], ['B', 2]],
                    columns=['Items', 'Sale'])
df1
Items Sale
A 200
B 410

Second Dataframe:

The second dataframe has three columns: Items, Sale and Category

df2 = pd.DataFrame([['A', 320, 'food'], ['B', 320, 'home'], ['C', 530, 'furniture']],
                    columns=['Items', 'Sale', 'Category'])
df2
Items Sale Category
A 320 Food
B 550 Home
C 530 Furniture

Concatenate the two dataframes

We have concatenated the two dataframes(df1 and df2) and can see them side by side and the final concatenated dataframe is stored in variable df

df=pd.concat([df1, df2],axis=1, keys = ['df1', 'df2'])
df
df1 df2
Items Sale Items Sales Category
0 A 200 A 320 Food
1 B 410 B 550 Home
2 NaN NaN C 530 Furniture

Compare the columns in two dataframe

We will find the difference between the sales value between two dataframe for each of the Items

We have added a new column called as sales-diff to find the differences between the sales value in two dataframes where the Item values are similar otherwise difference is set to 0.

numpy.where() is used to return choice depending on condition

df['sales-diff']=np.where(df['df1']['Items']==df['df2']['Items'],
                    (df['df1']['Sale']-df['df2']['Sale']),
                    0)

We’ve got a new column that shows exactly the difference between the Sales column between df2 and df1

df1 df2 sales-diff
Items Sale Items Sales Category
0 A 200 A 320 Food 120
1 B 410 B 550 Home 140
2 NaN NaN C 530 Furniture 0

Non-matching rows between two dataframes

Let’s find the rows not matching between two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are not in df2[‘Items’]

df[~df['df1']['Items'].isin(df['df2']['Items'])]
df1 df2 sales-diff
Items Sale Items Sales Category
2 NaN NaN C 530 Furniture 0

Matching rows between two dataframes

We will find the rows matching between the two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are in df2[‘Items’]

df[df['df1']['letter']==df['df2']['letter']]
df1 df2 sales-diff
Items Sale Items Sales Category
0 A 200 A 320 Food 120
1 B 410 B 550 Home 140

Alternatively, we can use pandas.merge() to merge the two dataframes(df1 and df2) on column Items and apply inner join, use intersection of keys from both dataframes, similar to a SQL inner join and preserve the order of the left keys

pd.merge(df1, df2, on='Items', how='inner')
Items Sale_x Sale_y Category
0 A 200 320 Food
1 B 410 550 Home

Tags: ,

Categories: ,

Updated: