7 minute read

DataFrames are a powerful tool for working with data in Python, and Pandas provides a number of ways to count duplicate rows in a DataFrame.

In this article, we’ll explore a few of the most common methods and highlight their advantages and disadvantages.

We will be using the following functions

  • pivot_table()
  • groupby()
  • value_counts()
  • lambda
  • duplicated()
  • len() with drop_duplicates()

We’ll also include some code examples to help you get started with counting duplicate rows in your own Pandas DataFrame.

In this blog post, we will discuss different ways to count duplicate rows in a pandas dataframe with a sample dataframe with 5 columns and 15 rows.

Create Dataframe

Let’s start by creating the test dataframe with duplicate rows

# importing required libraries
import pandas as pd

# creating dataframe
df = pd.DataFrame(
    {
        'Name': ['John', 'Mary', 'John',
                 'John', 'John', 'Mary',
                 'Mary', 'John', 'John',
                 'John', 'Mary', 'John',
                 'John', 'Mary', 'John'],
        'Age': [30, 23, 30, 30, 30, 23, 23,
                30, 30, 30, 23, 30, 30, 23,
                30],
        'Gender': ['M', 'F', 'M', 'M', 'M',
                   'F', 'F', 'M', 'M', 'M',
                   'F', 'M', 'M', 'F', 'M'],
        'Job': ['Doctor', 'Teacher', 'Student',
                'Doctor', 'Student', 'Teacher',
                'Teacher', 'Doctor', 'Student',
                'Doctor', 'Teacher', 'Student',
                'Doctor', 'Teacher', 'Student'],
        'Salary': [100000, 50000, 30000,
                   100000, 30000, 50000,
                   50000, 100000, 30000,
                   100000, 50000, 30000,
                   100000, 50000, 30000]
    })

# displaying the dataframe
df

Out:

  Name Age Gender Job Salary
0 John 30 M Doctor 100000
1 Mary 23 F Teacher 50000
2 John 30 M Student 30000
3 John 30 M Doctor 100000
4 John 30 M Student 30000
5 Mary 23 F Teacher 50000
6 Mary 23 F Teacher 50000
7 John 30 M Doctor 100000
8 John 30 M Student 30000
9 John 30 M Doctor 100000
10 Mary 23 F Teacher 50000
11 John 30 M Student 30000
12 John 30 M Doctor 100000
13 Mary 23 F Teacher 50000
14 John 30 M Student 30000

Now that we have our DataFrame created, let’s look at some of the ways we can count duplicate rows.

Count Duplicates

pivot_table

We’ll discuss is the pivot_table function in this section.

This function takes in the DataFrame and the columns you want to group by, and then counts the number of duplicates in each group.

To use the pivot_table function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.

The pivot_table function returns a DataFrame with the counts of duplicates for each group.

pivot_table - index value

count_dup = df.pivot_table(index = ['Name'],
                           aggfunc = 'size')
Name  
John 10
Mary 5

pivot_table - multiple index

count_dup = df.pivot_table(index = ['Name', 'Job'],
                           aggfunc = 'size')
Name Job  
John Doctor 5
John Student 5
Mary Teacher 5

pivot_table - multiple column

count_dup = df.pivot_table(columns = ['Name', 'Age',
                                      'Gender', 'Job', 'Salary'],
                           aggfunc = 'size')
Name Age Gender Job Salary  
John 30 M Doctor 100000 5
      Student 30000 5
Mary 23 F Teacher 50000 5

groupby

We’ll discuss is the groupby function in this section.

This function takes in the DataFrame and the columns you want to group by and returns a DataFrame with the counts of duplicates in each group.

To use the groupby function, we’ll pass in the DataFrame, the columns we want to group by, and the values we want to count.

The groupby function returns a DataFrame with the counts of duplicates for each group.

groupby - column

count_dup = df.groupby('Name').size()
Name  
John 10
Mary 5

groupby - multiple column

count_dup = df.groupby(['Name', 'Age', 'Gender', 'Job', 'Salary']).size()
Name Age Gender Job Salary  
John 30 M Doctor 100000 5
      Student 30000 5
Mary 23 F Teacher 50000 5

groupby - reset_index

count_dup = (
              df.groupby(df.columns.tolist())
              .size()
              .reset_index()
              .rename(columns = {0:'count'})
            )
  Name Age Gender Job Salary count
0 John 30 M Doctor 100000 5
1 John 30 M Student 30000 5
2 Mary 23 F Teacher 50000 5

groupby - reset_index by column

count_dup = (
             df.groupby('Job')
             .size()
             .reset_index()
             .rename(columns = {0:'count'})
            )
  Job count
0 Doctor 5
1 Student 5
2 Teacher 5

value_counts

We’ll discuss is the value_counts function in this section.

This function takes in the DataFrame and the columns you want to count and returns a Series with the counts of unique values in each column.

To use the value_counts function, we’ll pass in the DataFrame and the columns we want to count.

value_counts - column as index

count_dup = df['Name'].value_counts().reset_index(name = 'count')
  index count
0 John 10
1 Mary 5

value_counts - column as value

count_dup = df['Job'].value_counts()
counts = df[['Job']].apply(pd.value_counts)

Both give the same output.

Doctor 5
Teacher 5
Student 5

value_counts - multiple column

count_dup = df[['Name', 'Age']].value_counts().reset_index(name = 'count')
  Name Age count
0 John 30 10
1 Mary 23 5

value_counts - all columns

count_dup = df.value_counts().reset_index(name = 'count')
  Name Age Gender Job Salary count
0 John 30 M Doctor 100000 5
1 John 30 M Student 30000 5
2 Mary 23 F Teacher 50000 5

lambda

We’ll discuss is the lambda function in this section.

This function takes in the DataFrame, the columns you want to group by, and then returns the number of unique rows in each group.

To use the lambda function, we’ll pass in the DataFrame, the columns we want to group by, and then use the len() function to count the number of unique rows.

The lambda function returns a Series with the number of unique rows in each group.

lambda - column

count_dup = df.groupby('Name').apply(lambda x: len(x.dropna()))

count_dup = count_dup.to_frame(name = 'count').reset_index()
  Name count
0 John 10
1 Mary 5

lambda - multiple column

count_dup = df.groupby(['Name', 'Job']).apply(lambda x: len(x.dropna()))

count_dup = count_dup.to_frame(name = 'count').reset_index()
  Name Job count
0 John Doctor 5
1 John Student 5
2 Mary Teacher 5

lambda - total count of duplicates

This counts duplicates across the dataframe. It returns the value and count of how many times the duplicates appears.

count_dup = df.apply(lambda x: ' '
                     .join([f'[value = {i}, count = {v}]'
                            for i, v in x.value_counts().iteritems() if v > 1])
                    )
Name [value = John, count = 10] [value = Mary, coun…
Age [value = 30, count = 10] [value = 23, count = 5]
Gender [value = M, count = 10] [value = F, count = 5]
Job [value = Doctor, count = 5] [value = Teacher, …
Salary [value = 100000, count = 5] [value = 50000, co…

lambda - total count of duplicates

This gives similar result as to the above code but here we import a library to show the output. It shows how many times the duplicate values appear.

from collections import Counter

def count(x):
    return {v:c for v, c in x.items() if c > 1}

count_dup = df.apply(lambda x : count(Counter(x)))
Name {‘John’: 10, ‘Mary’: 5}
Age {30: 10, 23: 5}
Gender {‘M’: 10, ‘F’: 5}
Job {‘Doctor’: 5, ‘Teacher’: 5, ‘Student’: 5}
Salary {100000: 5, 50000: 5, 30000: 5}

duplicated

We’ll discuss is the duplicated function in this section.

This function takes in the DataFrame and returns a Boolean array of the duplicate rows.

To use the duplicated function, we’ll pass in the DataFrame and check for duplicates.

By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

duplicated - sum

count_dup = df.duplicated().sum()

count_dup.head()

This outputs the total number of duplicate rows in the dataframe.

Out: 12

duplicated with value_counts

count_dup = df.duplicated().value_counts()

Here, True are the duplicate rows, and False are the unique rows i.e. non-duplicate.

True 12
False 3

len with drop_duplicates

We’ll discuss is the len function with drop_duplicates function to check the repeating rows in this section.

We will use the len function to get the count of duplicates on a specific row or the entire DataFrame.

These functions return the length or count of the total number of duplicate single rows in a dataframe.

count of total non-duplicate single rows

len(df.drop_duplicates())

Out: 3

count of total duplicate rows in a column

count_dup = len(df['Job']) - len(df['Job'].drop_duplicates())

Out: 12

count of total duplicate rows in the dataframe

count_dup = len(df) - len(df.drop_duplicates())

Out: 12

So, there are a number of ways to count duplicate rows in a Pandas DataFrame.

Each of these methods has its own advantages and disadvantages, so it’s important to understand the different use cases for each one.

With this knowledge, you should be able to count duplicate rows in your own Pandas DataFrames with ease.

Tags: ,

Categories: ,

Updated: