pandas count duplicate rows
DataFrames are a powerful tool for working with data in Python, and Pandas provides a number of ways to count duplicate rows in a DataFrame.
In this article, we’ll explore a few of the most common methods and highlight their advantages and disadvantages.
We will be using the following functions
pivot_table()
groupby()
value_counts()
lambda
duplicated()
len()
withdrop_duplicates()
We’ll also include some code examples to help you get started with counting duplicate rows in your own Pandas DataFrame.
In this blog post, we will discuss different ways to count duplicate rows in a pandas dataframe with a sample dataframe with 5 columns and 15 rows.
Create DataframePermalink
Let’s start by creating the test dataframe with duplicate rows
# importing required libraries
import pandas as pd
# creating dataframe
df = pd.DataFrame(
{
'Name': ['John', 'Mary', 'John',
'John', 'John', 'Mary',
'Mary', 'John', 'John',
'John', 'Mary', 'John',
'John', 'Mary', 'John'],
'Age': [30, 23, 30, 30, 30, 23, 23,
30, 30, 30, 23, 30, 30, 23,
30],
'Gender': ['M', 'F', 'M', 'M', 'M',
'F', 'F', 'M', 'M', 'M',
'F', 'M', 'M', 'F', 'M'],
'Job': ['Doctor', 'Teacher', 'Student',
'Doctor', 'Student', 'Teacher',
'Teacher', 'Doctor', 'Student',
'Doctor', 'Teacher', 'Student',
'Doctor', 'Teacher', 'Student'],
'Salary': [100000, 50000, 30000,
100000, 30000, 50000,
50000, 100000, 30000,
100000, 50000, 30000,
100000, 50000, 30000]
})
# displaying the dataframe
df
Out:
Name | Age | Gender | Job | Salary | |
---|---|---|---|---|---|
0 | John | 30 | M | Doctor | 100000 |
1 | Mary | 23 | F | Teacher | 50000 |
2 | John | 30 | M | Student | 30000 |
3 | John | 30 | M | Doctor | 100000 |
4 | John | 30 | M | Student | 30000 |
5 | Mary | 23 | F | Teacher | 50000 |
6 | Mary | 23 | F | Teacher | 50000 |
7 | John | 30 | M | Doctor | 100000 |
8 | John | 30 | M | Student | 30000 |
9 | John | 30 | M | Doctor | 100000 |
10 | Mary | 23 | F | Teacher | 50000 |
11 | John | 30 | M | Student | 30000 |
12 | John | 30 | M | Doctor | 100000 |
13 | Mary | 23 | F | Teacher | 50000 |
14 | John | 30 | M | Student | 30000 |
Now that we have our DataFrame created, let’s look at some of the ways we can count duplicate rows.
Count DuplicatesPermalink
pivot_tablePermalink
We’ll discuss is the pivot_table
function in this section.
This function takes in the DataFrame and the columns you want to group by, and then counts the number of duplicates in each group.
To use the pivot_table
function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.
The pivot_table
function returns a DataFrame with the counts of duplicates for each group.
pivot_table - index valuePermalink
count_dup = df.pivot_table(index = ['Name'],
aggfunc = 'size')
Name | |
---|---|
John | 10 |
Mary | 5 |
pivot_table - multiple indexPermalink
count_dup = df.pivot_table(index = ['Name', 'Job'],
aggfunc = 'size')
Name | Job | |
---|---|---|
John | Doctor | 5 |
John | Student | 5 |
Mary | Teacher | 5 |
pivot_table - multiple columnPermalink
count_dup = df.pivot_table(columns = ['Name', 'Age',
'Gender', 'Job', 'Salary'],
aggfunc = 'size')
Name | Age | Gender | Job | Salary | |
---|---|---|---|---|---|
John | 30 | M | Doctor | 100000 | 5 |
Student | 30000 | 5 | |||
Mary | 23 | F | Teacher | 50000 | 5 |
groupbyPermalink
We’ll discuss is the groupby
function in this section.
This function takes in the DataFrame and the columns you want to group by and returns a DataFrame with the counts of duplicates in each group.
To use the groupby
function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.
The groupby
function returns a DataFrame with the counts of duplicates for each group.
groupby - columnPermalink
count_dup = df.groupby('Name').size()
Name | |
---|---|
John | 10 |
Mary | 5 |
groupby - multiple columnPermalink
count_dup = df.groupby(['Name', 'Age', 'Gender', 'Job', 'Salary']).size()
Name | Age | Gender | Job | Salary | |
---|---|---|---|---|---|
John | 30 | M | Doctor | 100000 | 5 |
Student | 30000 | 5 | |||
Mary | 23 | F | Teacher | 50000 | 5 |
groupby - reset_indexPermalink
count_dup = (
df.groupby(df.columns.tolist())
.size()
.reset_index()
.rename(columns = {0:'count'})
)
Name | Age | Gender | Job | Salary | count | |
---|---|---|---|---|---|---|
0 | John | 30 | M | Doctor | 100000 | 5 |
1 | John | 30 | M | Student | 30000 | 5 |
2 | Mary | 23 | F | Teacher | 50000 | 5 |
groupby - reset_index by columnPermalink
count_dup = (
df.groupby('Job')
.size()
.reset_index()
.rename(columns = {0:'count'})
)
Job | count | |
---|---|---|
0 | Doctor | 5 |
1 | Student | 5 |
2 | Teacher | 5 |
value_countsPermalink
We’ll discuss is the value_counts
function in this section.
This function takes in the DataFrame and the columns you want to count and returns a Series with the counts of unique values in each column.
To use the value_counts
function, we’ll pass in the DataFrame and the columns we want to count.
value_counts - column as indexPermalink
count_dup = df['Name'].value_counts().reset_index(name = 'count')
index | count | |
---|---|---|
0 | John | 10 |
1 | Mary | 5 |
value_counts - column as valuePermalink
count_dup = df['Job'].value_counts()
counts = df[['Job']].apply(pd.value_counts)
Both give the same output.
Doctor | 5 |
---|---|
Teacher | 5 |
Student | 5 |
value_counts - multiple columnPermalink
count_dup = df[['Name', 'Age']].value_counts().reset_index(name = 'count')
Name | Age | count | |
---|---|---|---|
0 | John | 30 | 10 |
1 | Mary | 23 | 5 |
value_counts - all columnsPermalink
count_dup = df.value_counts().reset_index(name = 'count')
Name | Age | Gender | Job | Salary | count | |
---|---|---|---|---|---|---|
0 | John | 30 | M | Doctor | 100000 | 5 |
1 | John | 30 | M | Student | 30000 | 5 |
2 | Mary | 23 | F | Teacher | 50000 | 5 |
lambdaPermalink
We’ll discuss is the lambda
function in this section.
This function takes in the DataFrame, the columns you want to group by, and then returns the number of unique rows in each group.
To use the lambda function, we’ll pass in the DataFrame, the columns we want to group by, and then use the len() function to count the number of unique rows.
The lambda function returns a Series with the number of unique rows in each group.
lambda - columnPermalink
count_dup = df.groupby('Name').apply(lambda x: len(x.dropna()))
count_dup = count_dup.to_frame(name = 'count').reset_index()
Name | count | |
---|---|---|
0 | John | 10 |
1 | Mary | 5 |
lambda - multiple columnPermalink
count_dup = df.groupby(['Name', 'Job']).apply(lambda x: len(x.dropna()))
count_dup = count_dup.to_frame(name = 'count').reset_index()
Name | Job | count | |
---|---|---|---|
0 | John | Doctor | 5 |
1 | John | Student | 5 |
2 | Mary | Teacher | 5 |
lambda - total count of duplicatesPermalink
This counts duplicates across the dataframe. It returns the value and count of how many times the duplicates appears.
count_dup = df.apply(lambda x: ' '
.join([f'[value = {i}, count = {v}]'
for i, v in x.value_counts().iteritems() if v > 1])
)
Name | [value = John, count = 10] [value = Mary, coun… |
---|---|
Age | [value = 30, count = 10] [value = 23, count = 5] |
Gender | [value = M, count = 10] [value = F, count = 5] |
Job | [value = Doctor, count = 5] [value = Teacher, … |
Salary | [value = 100000, count = 5] [value = 50000, co… |
lambda - total count of duplicatesPermalink
This gives similar result as to the above code but here we import a library to show the output. It shows how many times the duplicate values appear.
from collections import Counter
def count(x):
return {v:c for v, c in x.items() if c > 1}
count_dup = df.apply(lambda x : count(Counter(x)))
Name | {‘John’: 10, ‘Mary’: 5} |
---|---|
Age | {30: 10, 23: 5} |
Gender | {‘M’: 10, ‘F’: 5} |
Job | {‘Doctor’: 5, ‘Teacher’: 5, ‘Student’: 5} |
Salary | {100000: 5, 50000: 5, 30000: 5} |
duplicatedPermalink
We’ll discuss is the duplicated
function in this section.
This function takes in the DataFrame and returns a Boolean array of the duplicate rows.
To use the duplicated
function, we’ll pass in the DataFrame and check for duplicates.
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
duplicated - sumPermalink
count_dup = df.duplicated().sum()
count_dup.head()
This outputs the total number of duplicate rows in the dataframe.
Out: 12
duplicated with value_countsPermalink
count_dup = df.duplicated().value_counts()
Here, True are the duplicate rows, and False are the unique rows i.e. non-duplicate.
True | 12 |
---|---|
False | 3 |
len with drop_duplicatesPermalink
We’ll discuss is the len
function with drop_duplicates
function to check the repeating rows in this section.
We will use the len
function to get the count of duplicates on a specific row or the entire DataFrame.
These functions return the length or count of the total number of duplicate single rows in a dataframe.
count of total non-duplicate single rowsPermalink
len(df.drop_duplicates())
Out: 3
count of total duplicate rows in a columnPermalink
count_dup = len(df['Job']) - len(df['Job'].drop_duplicates())
Out: 12
count of total duplicate rows in the dataframePermalink
count_dup = len(df) - len(df.drop_duplicates())
Out: 12
So, there are a number of ways to count duplicate rows in a Pandas DataFrame.
Each of these methods has its own advantages and disadvantages, so it’s important to understand the different use cases for each one.
With this knowledge, you should be able to count duplicate rows in your own Pandas DataFrames with ease.