close

How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly

Hello Guys, How are you all? Hope You all Are Fine. Today We Are Going To learn about How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly in Python. So Here I am Explain to you all the possible Methods here.

Without wasting your time, Let’s start This Article.

How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly?

  1. How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly?

    You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, “other”.

  2. plot a value_counts in pandas that has a huge number of different counts not distributed evenly

    You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, “other”.

Method 1

You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, “other”.

By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled “other”:

import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
enter image description here

There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels.

The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%mid 70%, and top 5%:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
                      np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.], 
                 labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
enter image description here

Method 2

Just log the axis (I have no pandas, but it should be similar):

import numpy as np
import matplotlib.pyplot as plt

s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()

Conclusion

It’s all About this issue. Hope all Methods helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which Method worked for you? Thank You.

Also, Read