3 minute read

In this post we will see how to select specific rows in each group of pandas groupby object. There are certain cases where we are interested to select only the first, last, max, min or nth row in each group.

For example, In a dataset we would be interested to know what’s my maximum sales in each Area under a Region? or what are the top n Areas where Sale is highest? or what are my bottom n low performing Areas in terms of volume?

so let’s get started, First we will create a dataframe to work upon:

import pandas as pd

df=pd.DataFrame({'Region':['nordic','nordic','nordic','nordic','nordic','mid-east','mid-east',
                        'mid-east','mid-east','mid-east','south', 'south','south'],
                 'Area':[1,1,1,2,2,1,2,1,3,3,1,2,3],'Sales':[10,15,21,36,20,42,36,52,18,25,13,35,23]})
df

There are three columns in this dataframe Region, Area and Sales.

Output:

Region Area Sales
nordic 1 10
nordic 1 15
nordic 1 21
nordic 2 36
nordic 2 20
mid-east 1 42
mid-east 2 36
mid-east 1 52
mid-east 3 18
mid-east 3 25
south 1 13
south 2 35
south 3 23

Select rows with max or min value of sale sum in each Area under a Region

In this section we are interested to find the max, min rows from each group and we will first groupby Region, Area and sum up the sales

df_agg=df.groupby(['Region','Area']).agg({'Sales':sum})
df_agg

So you can see the below dataframe grouped by Region and Area and the total sum for each Area under a Region

Region Area Sales
mid-east 1 94
  2 36
  3 43
nordic 1 46
  2 56
south 1 13
  2 35
  3 23

There are couple of ways to find out the maximum and minimum of total sales in each Area, we will first groupby Region and sort the rows in each group in ascending order and get the last row(tail(1)) which will be maximum and the first row(head(1)) will be minimum

g=df_agg['Sales'].groupby('Region', group_keys=False)
g.apply(lambda x: x.sort_values(ascending=True).tail(1))

you can see the maximum sales is returned in each row by using tail(1)

Region Area  
mid-east 1 94
nordic 2 56
south 1 35
g=df_agg['Sales'].groupby('Region', group_keys=False)
g.apply(lambda x: x.sort_values(ascending=True).head(1))

And the minimum sales from each group is returned using head(1)

Region Area  
mid-east 2 36
nordic 1 46
south 1 13

Alternate way to find first, last and min,max rows in each group

Pandas has first, last, max and min functions that returns the first, last, max and min rows from each group

For computing the first row in each group just groupby Region and call first() function as shown below

df_agg=df.groupby(['Region','Area']).agg({'Sales':sum})
g=df_agg.groupby('Region', group_keys=False)['Sales'].first()

Output:

Region Area  
mid-east 1 94
nordic 2 46
south 1 13

Note: this is the first row in each group and not the max sales value

Similarly for the last row, you can use the last() function instead of first()

Now to return the max and min rows in each group you can just replace the first function above with the max or min as shown here

df_agg=df.groupby(['Region','Area']).agg({'Sales':sum})
g=df_agg.groupby('Region', group_keys=False)['Sales'].max()

Output:

Region Area  
mid-east 1 94
nordic 2 56
south 1 35

How to find nth row in each group

Pandas has an in-built function nth that take the nth row from each group if n is an integer, otherwise a subset of rows

Alright, let’s find get the first and last row from our group by object above

# both first and last row
df_agg=df_agg.reset_index()
df_agg.groupby('Region', group_keys=False)['Sales'].nth([0,-1])

You can see only the first and last row is returned from each group

Output:

Region Area  
mid-east 1 94
mid-east 3 43
nordic 1 46
nordic 2 56
south 1 13
south 3 23

You can also get the subset of rows which will be useful when you are looking for top 5 or 3 rows in each group, you can sort the values in each group and then slice it.

Here is an example to get a subset of rows, in this case we want the bottom 2 Sales from each group

g=df_agg.groupby('Region', group_keys=False)
g.apply(lambda x: x.sort_values('Sales',ascending=True)).groupby('Region', group_keys=False).nth[:2]

Output:

Region Area Sales
mid-east 2 36
mid-east 3 43
nordic 1 46
nordic 2 56
south 1 13
south 3 23

Tags: ,

Categories: ,

Updated: