close

How to calculate correlation between all columns and remove highly correlated ones using pandas?

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to calculate correlation between all columns and remove highly correlated ones using pandas in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

How to calculate correlation between all columns and remove highly correlated ones using pandas?

  1. How to calculate correlation between all columns and remove highly correlated ones using pandas?

    You can use the following for a given data frame df:
    corr_matrix = df.corr().abs() high_corr_var=np.where(corr_matrix>0.8)

  2. calculate correlation between all columns and remove highly correlated ones using pandas

    You can use the following for a given data frame df:
    corr_matrix = df.corr().abs() high_corr_var=np.where(corr_matrix>0.8)

Method 1

Here is the approach which I have used –

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

Method 2

You can use the following for a given data frame df:

corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]

Conclusion

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read