data science, pandas, python,

How to remove duplicate data from python dataframe

Posted on Oct 25, 2019 · 3 mins read
Share this

Not all data are perfect and we really need to get duplicate data removed from our dataset most of the time. it looks easy to clean up the duplicate data but in reality it isn’t. Sometimes you want to just remove the duplicates from one or more columns and the other time you want to delete duplicates based on some random condition. So we will see in this post how to easily and efficiently you can remove the duplicate data using drop_duplicates() function in pandas

Create Dataframe with Duplicate data

import pandas as pd
df = pd.DataFrame({'Name' : ['John','Harry','Gary','Richard','Anna','Richard','Gary'], 'Age' : [25,32,37,43,44,56,37],  'Zone' : ['East','West','North','South','East','West','North']})

Drop the Duplicate rows

The row at index 2 and 6 in above dataframe are duplicates and all the three columns Name, Age and Zone matches for these two rows. Now we will remove all the duplicate rows from the dataframe using drop_duplicates() function

df.drop_duplicates()

Drop Duplicates from a specific Column and Keep last row

We will group the rows for each zone and just keep the last in each group i.e. For Zone East we have two rows in original dataframe i.e. index 0 and 4 and we want to keep only index 4 in this zone

df.drop_duplicates('Zone',keep='last')

Drop Duplicates from a specific Column and Keep first row

We will group the rows for each zone and just keep the first in each group i.e. For Zone East we have two rows in original dataframe i.e. index 0 and 4 and we want to keep only index 0 in this zone

df.drop_duplicates('Zone',keep='first')

Drop all Duplicates from a specific Column

We will drop the zone wise duplicate rows in the original dataframe, Just change the value of Keep to False

df.drop_duplicates('Zone',keep=False)

Drop Duplicates in a group but keep the row with maximum value

We will keep the row with maximum aged person in each zone. So we will sort the rows by Age first in ascending order and then drop the duplicates in Zone column and set the Keep parameter to Last.

df.sort_values('Age',ascending=True).drop_duplicates('Zone',keep='last')

Drop Duplicates in a group but keep the row with minimum value

We will keep the row with minimum aged person in each zone. So we will sort the rows by Age first in descending order and then drop the duplicates in Zone column and set the Keep parameter to Last.

df.sort_values('Age',ascending=False).drop_duplicates('Zone',keep='last')

Drop Duplicates based on condition

We will remove duplicates based on the Zone column and where age is greater than 30

df[df.Age.gt(30) & ~(df.duplicated(['Zone'])) ]

Drop Duplicates across multiple Columns using Subset parameter

You can drop duplicates from multiple columns as well. just add them as list in subset parameter. Our original dataframe doesn’t have any such value so I will create a dataframe and remove the duplicates from more than one column

Here is a dataframe with row at index 0 and 7 as duplicates with same

import pandas as pd
df = pd.DataFrame({'Name' : ['John','Harry','Gary','Richard','Anna','Richard','Gary','John'], 'Age' : [25,32,37,43,44,56,37,25],  'Zone' : ['East','West','North','South','East','West','North','West']})

Now drop one of the two rows by setting the subset parameter as a list of column

df.drop_duplicates(subset=['Name','Age'])

Drop Duplicates from Series

We can also drop duplicates from a Pandas Series . here is a series with multiple duplicate rows

a = pd.Series([1,2,3,3,2,2,1,4,5,6,6,7,8], index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
a

Drop Duplicates and reset Index

We will remove the duplicates from series index and reset the index using reset_index() function else it will have the original index from the Series after dropping the Duplicates

a.drop_duplicates().reset_index(drop=True)