pandas count duplicate rows

7 minute read

DataFrames are a powerful tool for working with data in Python, and Pandas provides a number of ways to count duplicate rows in a DataFrame.

In this article, we’ll explore a few of the most common methods and highlight their advantages and disadvantages.

We will be using the following functions

pivot_table()
groupby()
value_counts()
lambda
duplicated()
len() with drop_duplicates()

We’ll also include some code examples to help you get started with counting duplicate rows in your own Pandas DataFrame.

In this blog post, we will discuss different ways to count duplicate rows in a pandas dataframe with a sample dataframe with 5 columns and 15 rows.

Create DataframePermalink

Let’s start by creating the test dataframe with duplicate rows

# importing required libraries
import pandas as pd

# creating dataframe
df = pd.DataFrame(
    {
        'Name': ['John', 'Mary', 'John',
                 'John', 'John', 'Mary',
                 'Mary', 'John', 'John',
                 'John', 'Mary', 'John',
                 'John', 'Mary', 'John'],
        'Age': [30, 23, 30, 30, 30, 23, 23,
                30, 30, 30, 23, 30, 30, 23,
                30],
        'Gender': ['M', 'F', 'M', 'M', 'M',
                   'F', 'F', 'M', 'M', 'M',
                   'F', 'M', 'M', 'F', 'M'],
        'Job': ['Doctor', 'Teacher', 'Student',
                'Doctor', 'Student', 'Teacher',
                'Teacher', 'Doctor', 'Student',
                'Doctor', 'Teacher', 'Student',
                'Doctor', 'Teacher', 'Student'],
        'Salary': [100000, 50000, 30000,
                   100000, 30000, 50000,
                   50000, 100000, 30000,
                   100000, 50000, 30000,
                   100000, 50000, 30000]
    })

# displaying the dataframe
df

Out:

	Name	Age	Gender	Job	Salary
0	John	30	M	Doctor	100000
1	Mary	23	F	Teacher	50000
2	John	30	M	Student	30000
3	John	30	M	Doctor	100000
4	John	30	M	Student	30000
5	Mary	23	F	Teacher	50000
6	Mary	23	F	Teacher	50000
7	John	30	M	Doctor	100000
8	John	30	M	Student	30000
9	John	30	M	Doctor	100000
10	Mary	23	F	Teacher	50000
11	John	30	M	Student	30000
12	John	30	M	Doctor	100000
13	Mary	23	F	Teacher	50000
14	John	30	M	Student	30000

Now that we have our DataFrame created, let’s look at some of the ways we can count duplicate rows.

Count DuplicatesPermalink

pivot_tablePermalink

We’ll discuss is the pivot_table function in this section.

This function takes in the DataFrame and the columns you want to group by, and then counts the number of duplicates in each group.

To use the pivot_table function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.

The pivot_table function returns a DataFrame with the counts of duplicates for each group.

pivot_table - index valuePermalink

count_dup = df.pivot_table(index = ['Name'],
                           aggfunc = 'size')

Name
John	10
Mary	5

pivot_table - multiple indexPermalink

count_dup = df.pivot_table(index = ['Name', 'Job'],
                           aggfunc = 'size')

Name	Job
John	Doctor	5
John	Student	5
Mary	Teacher	5

pivot_table - multiple columnPermalink

count_dup = df.pivot_table(columns = ['Name', 'Age',
                                      'Gender', 'Job', 'Salary'],
                           aggfunc = 'size')

Name	Age	Gender	Job	Salary
John	30	M	Doctor	100000	5
			Student	30000	5
Mary	23	F	Teacher	50000	5

groupbyPermalink

We’ll discuss is the groupby function in this section.

This function takes in the DataFrame and the columns you want to group by and returns a DataFrame with the counts of duplicates in each group.

To use the groupby function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.

The groupby function returns a DataFrame with the counts of duplicates for each group.

groupby - columnPermalink

count_dup = df.groupby('Name').size()

Name
John	10
Mary	5

groupby - multiple columnPermalink

count_dup = df.groupby(['Name', 'Age', 'Gender', 'Job', 'Salary']).size()

Name	Age	Gender	Job	Salary
John	30	M	Doctor	100000	5
			Student	30000	5
Mary	23	F	Teacher	50000	5

groupby - reset_indexPermalink

count_dup = (
              df.groupby(df.columns.tolist())
              .size()
              .reset_index()
              .rename(columns = {0:'count'})
            )

	Name	Age	Gender	Job	Salary	count
0	John	30	M	Doctor	100000	5
1	John	30	M	Student	30000	5
2	Mary	23	F	Teacher	50000	5

groupby - reset_index by columnPermalink

count_dup = (
             df.groupby('Job')
             .size()
             .reset_index()
             .rename(columns = {0:'count'})
            )

	Job	count
0	Doctor	5
1	Student	5
2	Teacher	5

value_countsPermalink

We’ll discuss is the value_counts function in this section.

This function takes in the DataFrame and the columns you want to count and returns a Series with the counts of unique values in each column.

To use the value_counts function, we’ll pass in the DataFrame and the columns we want to count.

value_counts - column as indexPermalink

count_dup = df['Name'].value_counts().reset_index(name = 'count')

	index	count
0	John	10
1	Mary	5

value_counts - column as valuePermalink

count_dup = df['Job'].value_counts()

counts = df[['Job']].apply(pd.value_counts)

Both give the same output.

Doctor	5
Teacher	5
Student	5

value_counts - multiple columnPermalink

count_dup = df[['Name', 'Age']].value_counts().reset_index(name = 'count')

	Name	Age	count
0	John	30	10
1	Mary	23	5

value_counts - all columnsPermalink

count_dup = df.value_counts().reset_index(name = 'count')

	Name	Age	Gender	Job	Salary	count
0	John	30	M	Doctor	100000	5
1	John	30	M	Student	30000	5
2	Mary	23	F	Teacher	50000	5

lambdaPermalink

We’ll discuss is the lambda function in this section.

This function takes in the DataFrame, the columns you want to group by, and then returns the number of unique rows in each group.

To use the lambda function, we’ll pass in the DataFrame, the columns we want to group by, and then use the len() function to count the number of unique rows.

The lambda function returns a Series with the number of unique rows in each group.

lambda - columnPermalink

count_dup = df.groupby('Name').apply(lambda x: len(x.dropna()))

count_dup = count_dup.to_frame(name = 'count').reset_index()

	Name	count
0	John	10
1	Mary	5

lambda - multiple columnPermalink

count_dup = df.groupby(['Name', 'Job']).apply(lambda x: len(x.dropna()))

count_dup = count_dup.to_frame(name = 'count').reset_index()

	Name	Job	count
0	John	Doctor	5
1	John	Student	5
2	Mary	Teacher	5

lambda - total count of duplicatesPermalink

This counts duplicates across the dataframe. It returns the value and count of how many times the duplicates appears.

count_dup = df.apply(lambda x: ' '
                     .join([f'[value = {i}, count = {v}]'
                            for i, v in x.value_counts().iteritems() if v > 1])
                    )

Name	[value = John, count = 10] [value = Mary, coun…
Age	[value = 30, count = 10] [value = 23, count = 5]
Gender	[value = M, count = 10] [value = F, count = 5]
Job	[value = Doctor, count = 5] [value = Teacher, …
Salary	[value = 100000, count = 5] [value = 50000, co…

lambda - total count of duplicatesPermalink

This gives similar result as to the above code but here we import a library to show the output. It shows how many times the duplicate values appear.

from collections import Counter

def count(x):
    return {v:c for v, c in x.items() if c > 1}

count_dup = df.apply(lambda x : count(Counter(x)))

Name	{‘John’: 10, ‘Mary’: 5}
Age	{30: 10, 23: 5}
Gender	{‘M’: 10, ‘F’: 5}
Job	{‘Doctor’: 5, ‘Teacher’: 5, ‘Student’: 5}
Salary	{100000: 5, 50000: 5, 30000: 5}

duplicatedPermalink

We’ll discuss is the duplicated function in this section.

This function takes in the DataFrame and returns a Boolean array of the duplicate rows.

To use the duplicated function, we’ll pass in the DataFrame and check for duplicates.

By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

duplicated - sumPermalink

count_dup = df.duplicated().sum()

count_dup.head()

This outputs the total number of duplicate rows in the dataframe.

Out: 12

duplicated with value_countsPermalink

count_dup = df.duplicated().value_counts()

Here, True are the duplicate rows, and False are the unique rows i.e. non-duplicate.

True	12
False	3

len with drop_duplicatesPermalink

We’ll discuss is the len function with drop_duplicates function to check the repeating rows in this section.

We will use the len function to get the count of duplicates on a specific row or the entire DataFrame.

These functions return the length or count of the total number of duplicate single rows in a dataframe.

count of total non-duplicate single rowsPermalink

len(df.drop_duplicates())

Out: 3

count of total duplicate rows in a columnPermalink

count_dup = len(df['Job']) - len(df['Job'].drop_duplicates())

Out: 12

count of total duplicate rows in the dataframePermalink

count_dup = len(df) - len(df.drop_duplicates())

Out: 12

So, there are a number of ways to count duplicate rows in a Pandas DataFrame.

Each of these methods has its own advantages and disadvantages, so it’s important to understand the different use cases for each one.

With this knowledge, you should be able to count duplicate rows in your own Pandas DataFrames with ease.

Share on

Twitter Facebook LinkedIn

pandas count duplicate rows

Create DataframePermalink

Count DuplicatesPermalink

pivot_tablePermalink

pivot_table - index valuePermalink

pivot_table - multiple indexPermalink

pivot_table - multiple columnPermalink

groupbyPermalink

groupby - columnPermalink

groupby - multiple columnPermalink

groupby - reset_indexPermalink

groupby - reset_index by columnPermalink

value_countsPermalink

value_counts - column as indexPermalink

value_counts - column as valuePermalink

value_counts - multiple columnPermalink

value_counts - all columnsPermalink

lambdaPermalink

lambda - columnPermalink

lambda - multiple columnPermalink

lambda - total count of duplicatesPermalink

lambda - total count of duplicatesPermalink

duplicatedPermalink

duplicated - sumPermalink

duplicated with value_countsPermalink

len with drop_duplicatesPermalink

count of total non-duplicate single rowsPermalink

count of total duplicate rows in a columnPermalink

count of total duplicate rows in the dataframePermalink

Share on

You may also enjoy

Pandas value error while merging two dataframes with different data types

How to get True Positive, False Positive, True Negative and False Negative from confusion matrix in scikit learn

Pandas how to use list of values to select rows from a dataframe

How to make two plots side by side and create different size subplots in python using matplotlib