How to count unique records by two columns in pandas?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to count unique records by two columns in pandas in Python. So Here I am Explain to you all the possible Methods here.

How to count unique records by two columns in pandas?

1. How to count unique records by two columns in pandas?

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

2. count unique records by two columns in pandas

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

Method 1

By using ngroups

df.groupby(['col_a', 'col_b']).ngroups
Out: 6

Or using set

len(set(zip(df['col_a'],df['col_b'])))
Out: 6

Method 2

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

df[['col_a', 'col_b']].drop_duplicates().shape
# 6

len(df[['col_a', 'col_b']].drop_duplicates())
# 6

Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:

Consider a data frame as following:

df = pd.DataFrame({
'col_a': [1,2,2,pd.np.nan,1,4],
'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})

print(df)

#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN

Timing:

df = pd.concat([df] * 1000)

%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop

%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop

%timeit df[['col_a', 'col_b']].drop_duplicates().shape
# 1000 loops, best of 3: 1.01 ms per loop

%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop

%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop

Result:

df.groupby(['col_a', 'col_b']).ngroups
# 3

len(df[['col_a', 'col_b']].drop_duplicates())
# 5

df[['col_a', 'col_b']].drop_duplicates().shape
# 5

len(set(zip(df['col_a'],df['col_b'])))
# 2003

len(df.groupby(['col_a', 'col_b']))
# 2003

So the difference:

Option 1:

df.groupby(['col_a', 'col_b']).ngroups

is fast, and it excludes rows that contain NaNs.

Option 2 & 3:

len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape

Reasonably fast, it considers NaNs as a unique value.

Option 4 & 5:

len(set(zip(df['col_a'],df['col_b'])))
len(df.groupby(['col_a', 'col_b']))

slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.

Summery

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.