[英]Pandas returning empty groups in groupby
I have a Pandas DataFrame with 3 columns, target
, pred
, and conf_bin
.我有一个 Pandas DataFrame 有 3 列, target
, pred
和conf_bin
。 If I run a groupby(by='conf_bin').apply(...)
my apply function gets called with empty DataFrame
s for values that do not appear in the conf_bin
column.如果我运行groupby(by='conf_bin').apply(...)
我的 apply function 将被调用,空的DataFrame
s 没有出现在conf_bin
列中的值。 How is this possible?这怎么可能?
Details细节
The DataFrame looks something like this: DataFrame 看起来像这样:
target pred conf_bin
0 5 6 0.50
1 4 4 0.60
2 4 4 0.50
3 4 3 0.50
4 4 5 0.50
5 5 5 0.55
6 5 5 0.55
7 5 5 0.55
Obviously conf_bin
is a numeric bin with values in the range np.arange(0, 1, 0.05)
.显然conf_bin
是一个数值 bin,其值在np.arange(0, 1, 0.05)
范围内。 However, not all values are present in the data:但是,并非所有值都存在于数据中:
In [224]: grp = tp.groupby(by='conf_bin')
In [225]: grp.groups.keys()
Out[225]: dict_keys([0.5, 0.60000000000000009, 0.35000000000000003, 0.75, 0.85000000000000009, 0.65000000000000002, 0.55000000000000004, 0.80000000000000004, 0.20000000000000001, 0.45000000000000001, 0.40000000000000002, 0.30000000000000004, 0.70000000000000007, 0.25])
So, for example, the values 0
and 0.05
do not appear.因此,例如,值0
和0.05
不会出现。 However, when I run an apply
on the group my function does get called for these values:但是,当我在组上运行apply
程序时,我的 function 确实会被要求提供这些值:
In [226]: grp.apply(lambda x: x.shape)
Out[226]:
conf_bin
0.00 (0, 3)
0.05 (0, 3)
0.10 (0, 3)
0.15 (0, 3)
0.20 (22, 3)
0.25 (75, 3)
0.30 (95, 3)
0.35 (870, 3)
0.40 (8505, 3)
0.45 (40068, 3)
0.50 (51238, 3)
0.55 (54305, 3)
0.60 (47191, 3)
0.65 (38977, 3)
0.70 (34444, 3)
0.75 (20435, 3)
0.80 (3352, 3)
0.85 (4, 3)
0.90 (0, 3)
dtype: object
Questions:问题:
DataFrame
? Pandas 怎么知道值 0.0 和 0.5 “有意义”,因为它们没有出现在我的DataFrame
中?DataFrame
objects for values that do no appear in grp.groups
?为什么它用空的DataFrame
对象调用我的 apply function 以获取未出现在grp.groups
中的值?I too was having this problem, which popped up when trying to create subplots for every category in my dataframe.我也遇到了这个问题,当我尝试为我的数据框中的每个类别创建子图时,这个问题就出现了。
I came up with the following workaround (based on this SO post ), by pulling out the non-empty groups into a list.我想出了以下解决方法(基于此 SO 帖子),将非空组拉到列表中。
groups = df.groupby('conf_bin')
group_list = [(index, group) for index, group in groups if len(group) > 0]
It does break the implicit contract that "you wrangle your data in pandas", and probably mismanages memory, but it works.它确实打破了“你在熊猫中纠缠你的数据”的隐含契约,并且可能会管理不善,但它确实有效。
Now you can iterate through your groupby list with the same interface as with a groupby object, eg现在,您可以使用与 groupby 对象相同的界面遍历 groupby 列表,例如
fig, axes = plt.subplots(nrows=len(group_list), ncols=1)
for (index, group), ax in zip(group_list, axes.flatten()):
group['target'].plot(ax=ax, title=index)
The main advantages of using categorical dtype are:使用分类数据类型的主要优点是:
The cons are:缺点是:
You could have more in-depth information from this article: https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71您可以从这篇文章中获得更深入的信息: https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.