简体   繁体   中英

Labeling a Bar Graph Created from a Grouped Pandas DataFrame where there's a NaN Category

I create a nice and tidy grouped data frame and then I use that data in a simple seaborn barplot. However, when I try to add labels to the bars, I get the following error:

ValueError: cannot convert float NaN to integer

I know this is because there is only one value (instead of two) for one of the grouped categories. How do I get it to label it "0"?

I've gone down the rabbit hole on this for a full day and haven't found anything. Here are the things that I've tried (in many different ways):

  • Inserting a row into the grouped dataframe.
  • Using pd.fillna() .
  • Creating a function to apply within the labeling clause.

I work with a lot of data that frequently encounters this sort of problem, so I would really appreciate some help in solving this. It seems so simple. What am I missing? Thanks!

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# my initial data set 
d = {'year' : [2014,2014,2014,2015,2015,],
     'status' : ["n","y","n","n","n"],
     'num' : [1,1,1,1,1]}
df = pd.DataFrame(d)

# groupby to create another dataframe
df2 = (df["status"]
    .groupby(df["year"])
    .value_counts(normalize=True)
    .rename("Percent")
    .apply(lambda x: x*100)
    .reset_index())

# create my bar plot
f = plt.figure(figsize = (11,8.5))

ax1 = plt.subplot(2,2,1)
sns.barplot(x="year",
           y="Percent",
           hue="status",
           hue_order = ["n","y"],
           data=df2,
           ci = None)

# label the bars
for p in ax1.patches:
    ax1.text(p.get_x() + p.get_width()/2., p.get_height(), '%d%%' % round(p.get_height()), 
        fontsize=10, color='red', ha='center', va='bottom')

plt.show()

You could handle the empty-bar case by setting the height to zero if p.get_height() returns NaN:

for p in ax1.patches:
    height = p.get_height()
    if np.isnan(height):
        height = 0
    ax1.text(p.get_x() + p.get_width()/2., height, '%d%%' % round(height), 
        fontsize=10, color='red', ha='center', va='bottom')

gives me

示例显示0%

Alternatively, you could expand your frame to ensure there's a zero there:

non_data_cols = df2.columns.drop("Percent")
full_index = pd.MultiIndex.from_product([df[col].unique() for col in non_data_cols], names=non_data_cols)
df2 = df2.set_index(non_data_cols.tolist()).reindex(full_index).fillna(0).reset_index()

which expands to give me

In [74]: df2
Out[74]: 
   year status     Percent
0  2014      n   66.666667
1  2014      y   33.333333
2  2015      n  100.000000
3  2015      y    0.000000

When dealing with data where you have missing categories, a common trick that can be employed is stacking and unstacking the data. The general idea can be viewed in this answer . Once the data is formatted, you are able to fillna with your fill value (in this case 0), and leave your code as is.

All you have to do is replace your current creation of df2 with the below code.


df2 = (df.groupby('year').status.value_counts(normalize=True).mul(100)
          .unstack().stack(dropna=False).fillna(0)
          .rename('Percent').reset_index())

Which gives us:

   year status     Percent
0  2014      n   66.666667
1  2014      y   33.333333
2  2015      n  100.000000
3  2015      y    0.000000

Now, with no changes to your plotting code, I get this output:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM