Pandas 在 groupby 中返回空组

Question

我有一个 Pandas DataFrame 有 3 列， target ， pred和conf_bin 。 如果我运行groupby(by='conf_bin').apply(...)我的 apply function 将被调用，空的DataFrame s 没有出现在conf_bin列中的值。 这怎么可能？

细节

DataFrame 看起来像这样：

        target  pred conf_bin
0            5     6     0.50
1            4     4     0.60
2            4     4     0.50
3            4     3     0.50
4            4     5     0.50
5            5     5     0.55
6            5     5     0.55
7            5     5     0.55

显然conf_bin是一个数值 bin，其值在np.arange(0, 1, 0.05)范围内。 但是，并非所有值都存在于数据中：

In [224]: grp = tp.groupby(by='conf_bin')

In [225]: grp.groups.keys()
Out[225]: dict_keys([0.5, 0.60000000000000009, 0.35000000000000003, 0.75, 0.85000000000000009, 0.65000000000000002, 0.55000000000000004, 0.80000000000000004, 0.20000000000000001, 0.45000000000000001, 0.40000000000000002, 0.30000000000000004, 0.70000000000000007, 0.25])

因此，例如，值0和0.05不会出现。 但是，当我在组上运行apply程序时，我的 function 确实会被要求提供这些值：

In [226]: grp.apply(lambda x: x.shape)
Out[226]:
conf_bin
0.00        (0, 3)
0.05        (0, 3)
0.10        (0, 3)
0.15        (0, 3)
0.20       (22, 3)
0.25       (75, 3)
0.30       (95, 3)
0.35      (870, 3)
0.40     (8505, 3)
0.45    (40068, 3)
0.50    (51238, 3)
0.55    (54305, 3)
0.60    (47191, 3)
0.65    (38977, 3)
0.70    (34444, 3)
0.75    (20435, 3)
0.80     (3352, 3)
0.85        (4, 3)
0.90        (0, 3)
dtype: object

问题：

Pandas 怎么知道值 0.0 和 0.5 “有意义”，因为它们没有出现在我的DataFrame中？
为什么它用空的DataFrame对象调用我的 apply function 以获取未出现在grp.groups中的值？

Answer 1

我也遇到了这个问题，当我尝试为我的数据框中的每个类别创建子图时，这个问题就出现了。

我想出了以下解决方法（基于此 SO 帖子），将非空组拉到列表中。

groups = df.groupby('conf_bin')
group_list = [(index, group) for index, group in groups if len(group) > 0]

它确实打破了“你在熊猫中纠缠你的数据”的隐含契约，并且可能会管理不善，但它确实有效。

现在，您可以使用与 groupby 对象相同的界面遍历 groupby 列表，例如

fig, axes = plt.subplots(nrows=len(group_list), ncols=1)
for (index, group), ax in zip(group_list, axes.flatten()):
    group['target'].plot(ax=ax, title=index)

Answer 2

使用分类数据类型的主要优点是：

Memory效率。 数据存储为 integer 码，比字符串更小，category 类型比 object 类型或 int 类型数据需要更少的 memory 来存储相同数量的数据。
更快的处理。 分组依据等分类数据操作通常比对 object 或 int 类型数据的等效操作更快，因为它们可以对 integer 代码执行，这比字符串更有效。

缺点是：

group by output：groupby的output很乱。 根据您的类别值，会生成大量 Nan。
同样的问题也适用于过滤。
类别类型的连接问题：类别类型链接到值字典，因此当您连接或合并时，您将遇到麻烦并且类别 dtype 丢失。

您可以从这篇文章中获得更深入的信息： https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71

Pandas 在 groupby 中返回空组

问题描述

1 个解决方案

解决方案1
2 2018-03-27 19:02:36

解决方案2
0 2023-01-27 22:23:38

Pandas 在 groupby 中返回空组

问题描述

1 个解决方案

解决方案1 2 2018-03-27 19:02:36

解决方案2 0 2023-01-27 22:23:38

解决方案1
2 2018-03-27 19:02:36

解决方案2
0 2023-01-27 22:23:38