close

How to “select distinct” across multiple data frame columns in pandas?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to “select distinct” across multiple data frame columns in pandas in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

Table of Contents

How to “select distinct” across multiple data frame columns in pandas?

  1. How to “select distinct” across multiple data frame columns in pandas?

    I've tried different solutions. First was:
    a_df=np.unique(df[['col1','col2']], axis=0)

  2. “select distinct” across multiple data frame columns in pandas

    I've tried different solutions. First was:
    a_df=np.unique(df[['col1','col2']], axis=0)

Method 1

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness.

Method 2

I’ve tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

Conclusion

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read