简体   繁体   English

标记从NaN类别分组的Pandas DataFrame创建的条形图

[英]Labeling a Bar Graph Created from a Grouped Pandas DataFrame where there's a NaN Category

I create a nice and tidy grouped data frame and then I use that data in a simple seaborn barplot. 我创建了一个很好且整洁的分组数据框,然后在一个简单的seabar barplot中使用该数据。 However, when I try to add labels to the bars, I get the following error: 但是,当我尝试向标签添加标签时,出现以下错误:

ValueError: cannot convert float NaN to integer ValueError:无法将float NaN转换为整数

I know this is because there is only one value (instead of two) for one of the grouped categories. 我知道这是因为分组类别之一只有一个值(而不是两个)。 How do I get it to label it "0"? 如何将其标记为“ 0”?

I've gone down the rabbit hole on this for a full day and haven't found anything. 我已经整整一天在兔子洞里走了,什么也没发现。 Here are the things that I've tried (in many different ways): 以下是我尝试过的方法(以许多不同方式):

  • Inserting a row into the grouped dataframe. 在分组的数据框中插入一行。
  • Using pd.fillna() . 使用pd.fillna()
  • Creating a function to apply within the labeling clause. 创建一个要在labeling子句中应用的函数。

I work with a lot of data that frequently encounters this sort of problem, so I would really appreciate some help in solving this. 我处理大量经常遇到此类问题的数据,因此我非常感谢您为解决此问题提供的帮助。 It seems so simple. 似乎很简单。 What am I missing? 我想念什么? Thanks! 谢谢!

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# my initial data set 
d = {'year' : [2014,2014,2014,2015,2015,],
     'status' : ["n","y","n","n","n"],
     'num' : [1,1,1,1,1]}
df = pd.DataFrame(d)

# groupby to create another dataframe
df2 = (df["status"]
    .groupby(df["year"])
    .value_counts(normalize=True)
    .rename("Percent")
    .apply(lambda x: x*100)
    .reset_index())

# create my bar plot
f = plt.figure(figsize = (11,8.5))

ax1 = plt.subplot(2,2,1)
sns.barplot(x="year",
           y="Percent",
           hue="status",
           hue_order = ["n","y"],
           data=df2,
           ci = None)

# label the bars
for p in ax1.patches:
    ax1.text(p.get_x() + p.get_width()/2., p.get_height(), '%d%%' % round(p.get_height()), 
        fontsize=10, color='red', ha='center', va='bottom')

plt.show()

You could handle the empty-bar case by setting the height to zero if p.get_height() returns NaN: 如果p.get_height()返回NaN,则p.get_height()通过将高度设置为零来处理空条情况:

for p in ax1.patches:
    height = p.get_height()
    if np.isnan(height):
        height = 0
    ax1.text(p.get_x() + p.get_width()/2., height, '%d%%' % round(height), 
        fontsize=10, color='red', ha='center', va='bottom')

gives me 给我

示例显示0%

Alternatively, you could expand your frame to ensure there's a zero there: 另外,您可以扩展框架以确保其中的值为零:

non_data_cols = df2.columns.drop("Percent")
full_index = pd.MultiIndex.from_product([df[col].unique() for col in non_data_cols], names=non_data_cols)
df2 = df2.set_index(non_data_cols.tolist()).reindex(full_index).fillna(0).reset_index()

which expands to give me 扩展给我

In [74]: df2
Out[74]: 
   year status     Percent
0  2014      n   66.666667
1  2014      y   33.333333
2  2015      n  100.000000
3  2015      y    0.000000

When dealing with data where you have missing categories, a common trick that can be employed is stacking and unstacking the data. 当您处理缺少类别的数据时,可以采用的常见技巧是堆叠和堆叠数据。 The general idea can be viewed in this answer . 总体思路可以从这个答案中看出。 Once the data is formatted, you are able to fillna with your fill value (in this case 0), and leave your code as is. 格式化数据后,您就可以使用填充值(在这种情况下为0)进行填充,并保持代码fillna

All you have to do is replace your current creation of df2 with the below code. 您所要做的就是用以下代码替换当前创建的df2


df2 = (df.groupby('year').status.value_counts(normalize=True).mul(100)
          .unstack().stack(dropna=False).fillna(0)
          .rename('Percent').reset_index())

Which gives us: 这给了我们:

   year status     Percent
0  2014      n   66.666667
1  2014      y   33.333333
2  2015      n  100.000000
3  2015      y    0.000000

Now, with no changes to your plotting code, I get this output: 现在,在不更改绘图代码的情况下,我得到以下输出:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM