简体   繁体   English

Pandas 在 groupby 中返回空组

[英]Pandas returning empty groups in groupby

I have a Pandas DataFrame with 3 columns, target , pred , and conf_bin .我有一个 Pandas DataFrame 有 3 列, targetpredconf_bin If I run a groupby(by='conf_bin').apply(...) my apply function gets called with empty DataFrame s for values that do not appear in the conf_bin column.如果我运行groupby(by='conf_bin').apply(...)我的 apply function 将被调用,空的DataFrame s 没有出现在conf_bin列中的值。 How is this possible?这怎么可能?


Details细节

The DataFrame looks something like this: DataFrame 看起来像这样:

        target  pred conf_bin
0            5     6     0.50
1            4     4     0.60
2            4     4     0.50
3            4     3     0.50
4            4     5     0.50
5            5     5     0.55
6            5     5     0.55
7            5     5     0.55

Obviously conf_bin is a numeric bin with values in the range np.arange(0, 1, 0.05) .显然conf_bin是一个数值 bin,其值在np.arange(0, 1, 0.05)范围内。 However, not all values are present in the data:但是,并非所有值都存在于数据中:

In [224]: grp = tp.groupby(by='conf_bin')

In [225]: grp.groups.keys()
Out[225]: dict_keys([0.5, 0.60000000000000009, 0.35000000000000003, 0.75, 0.85000000000000009, 0.65000000000000002, 0.55000000000000004, 0.80000000000000004, 0.20000000000000001, 0.45000000000000001, 0.40000000000000002, 0.30000000000000004, 0.70000000000000007, 0.25])

So, for example, the values 0 and 0.05 do not appear.因此,例如,值00.05不会出现。 However, when I run an apply on the group my function does get called for these values:但是,当我在组上运行apply程序时,我的 function 确实会被要求提供这些值:

In [226]: grp.apply(lambda x: x.shape)
Out[226]:
conf_bin
0.00        (0, 3)
0.05        (0, 3)
0.10        (0, 3)
0.15        (0, 3)
0.20       (22, 3)
0.25       (75, 3)
0.30       (95, 3)
0.35      (870, 3)
0.40     (8505, 3)
0.45    (40068, 3)
0.50    (51238, 3)
0.55    (54305, 3)
0.60    (47191, 3)
0.65    (38977, 3)
0.70    (34444, 3)
0.75    (20435, 3)
0.80     (3352, 3)
0.85        (4, 3)
0.90        (0, 3)
dtype: object

Questions:问题:

  1. How can Pandas even know that the values 0.0 and 0.5 "make sense" since they don't appear in my DataFrame ? Pandas 怎么知道值 0.0 和 0.5 “有意义”,因为它们没有出现在我的DataFrame中?
  2. Why is it calling my apply function with empty DataFrame objects for values that do no appear in grp.groups ?为什么它用空的DataFrame对象调用我的 apply function 以获取未出现在grp.groups中的值?

I too was having this problem, which popped up when trying to create subplots for every category in my dataframe.我也遇到了这个问题,当我尝试为我的数据框中的每个类别创建子图时,这个问题就出现了。

I came up with the following workaround (based on this SO post ), by pulling out the non-empty groups into a list.我想出了以下解决方法(基于此 SO 帖子),将非空组拉到列表中。

groups = df.groupby('conf_bin')
group_list = [(index, group) for index, group in groups if len(group) > 0]

It does break the implicit contract that "you wrangle your data in pandas", and probably mismanages memory, but it works.它确实打破了“你在熊猫中纠缠你的数据”的隐含契约,并且可能会管理不善,但它确实有效。


Now you can iterate through your groupby list with the same interface as with a groupby object, eg现在,您可以使用与 groupby 对象相同的界面遍历 groupby 列表,例如

fig, axes = plt.subplots(nrows=len(group_list), ncols=1)
for (index, group), ax in zip(group_list, axes.flatten()):
    group['target'].plot(ax=ax, title=index)

The main advantages of using categorical dtype are:使用分类数据类型的主要优点是:

  • Memory efficiency. Memory效率。 The data is stored as integer codes, which are smaller in size than strings, the category type requires less memory to store the same amount of data compared to object type or int type data.数据存储为 integer 码,比字符串更小,category 类型比 object 类型或 int 类型数据需要更少的 memory 来存储相同数量的数据。
  • Faster processing.更快的处理。 Categorical data operations such as group by are generally faster than equivalent operations on object or int type data because they can be performed on the integer codes, which are more efficient to work with than strings.分组依据等分类数据操作通常比对 object 或 int 类型数据的等效操作更快,因为它们可以对 integer 代码执行,这比字符串更有效。

The cons are:缺点是:

  • group by output: the output of the groupby is very messy. group by output:groupby的output很乱。 a lof of Nan are generated depending on your categories values.根据您的类别值,会生成大量 Nan。
  • the same problem applies to the filtering.同样的问题也适用于过滤。
  • concatenation issue with category type: the category type is linked to a dictionary of values so when you concatenate or merge you will have trouble and the loss of the category dtype.类别类型的连接问题:类别类型链接到值字典,因此当您连接或合并时,您将遇到麻烦并且类别 dtype 丢失。

You could have more in-depth information from this article: https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71您可以从这篇文章中获得更深入的信息: https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM