简体   繁体   English

pandas.groupby.agg中可能存在错误?

[英]Possible Bug in pandas.groupby.agg?

I might have found a bug in pandas.groupby.agg. 我可能在pandas.groupby.agg中发现了一个错误。 Try the following code. 请尝试以下代码。 It looks like what is passed to the aggregate function fn() is a data frame including the key. 看起来传递给聚合函数的内容fn()是包含密钥的数据框。 In my understanding, the agg function is applied to each column separately and only one column is passed. 根据我的理解,agg函数分别应用于每个列,只传递一列。 Since the 'year' column appears in groupby, it should be removed from the grouped results. 由于'year'列出现在groupby中,因此应将其从分组结果中删除。

import pandas as pd
import numpy as np

df = pd.DataFrame({'year' : [2011,2011,2012,2012,2013], '5-1' : [1.2, 2.1,2.1,11., 13.]})

def fn(x):
    print x
    #return np.mean(x) will explode
    return 0


res = df.groupby('year').agg(fn)
print res

The above gives the output, which clearly tells me that x of fn(x) is passed as a DataFrame with two columns (year, 5-1). 上面给出了输出,它清楚地告诉我fn(x)的x作为具有两列(年,5-1)的DataFrame传递。

   5-1  year
0  1.2  2011
1  2.1  2011
    5-1  year
2   2.1  2012
3  11.0  2012
   5-1  year
4   13  2013
      5-1
year     
2011    0
2012    0
2013    0

To answer your question, if you absolutely want the function applied to a Series , use the {column: aggfunc} syntax in .agg() . 要回答您的问题,如果您绝对希望将函数应用于Series ,请使用.agg(){column: aggfunc}语法。

That said, your code seems to work fine (at least on the current master). 也就是说,你的代码似乎工作正常(至少在当前的主人身上)。 The function isn't actually being applied to the year column. 该功能实际上并未应用于year列。


A bit of explanation. 一点解释。 For this I'm assuming that you are on an older version of pandas, and that that version had a bug that has since been patched. 为此,我假设您使用的是旧版本的熊猫,并且该版本有一个已经修补过的bug。 To reproduce the behavior I think you were getting, lets redefine fn : 为了重现我认为你得到的行为,让我们重新定义fn

In [32]: def fn(x):
    print("Printing x+1 : {}".format(x + 1))
    print("Printing x: {}".format(x))
    return 0

And let's redefine df['year'] 让我们重新定义df['year']

In [33]: df['year'] = ['a', 'a', 'b', 'b', 'c']

All these objects are defined in pandas/core/groupby.py . 所有这些对象都在pandas/core/groupby.py中定义。 The df.groupby('year') part returns a DataFrameGroupby object, since df is a DataFrame . df.groupby('year')部分返回一个DataFrameGroupby对象,因为df是一个DataFrame .agg() isn't actually defined on DataFrameGroupBy , that's on its parent class NDFrameGroupBy . .agg()实际上并未在DataFrameGroupBy上定义,它位于其父类NDFrameGroupBy

Since this ins't a Cython function, things get handed off to NDFrameGroupBy._aggregate_generic() . 由于这不是一个Cython函数,所以事情会转移到NDFrameGroupBy._aggregate_generic() That tries to execute the function, and if it fails, falls back to a separate section of code: 尝试执行该函数,如果失败,则回退到单独的代码部分:

    try:
        for name, data in self:
            result[name] = self._try_cast(func(data, *args, **kwargs),
                                          data)
    except Exception:
        return self._aggregate_item_by_item(func, *args, **kwargs)

If the try part succeeds, the function is applied to the entire object (which is why print x shows both columns), and the results are presented nicely with the grouper on the index and the values in the columns. 如果try部分成功,则该函数将应用于整个对象(这就是print x显示两列的原因),并且结果与索引上的分组器和列中的值很好地呈现。

If the try part fails, things are handed off to _aggregate_item_by_item , which excludes the grouping column . 如果try部件失败,则会将事件移交给_aggregate_item_by_item这将排除分组列

This means that by changing your code from return np.mean(x) to return 0 , you actually changed the path the code follows . 这意味着通过将代码从return np.mean(x) 更改return 0您实际上更改了代码所遵循的路径 Before, when you tried to take the mean , I think it failed and fell back to _aggregate_item_by_item (That's why I had you redefine df['year'] , and fn , that will fail for sure). 之前,当你试图采取mean ,我认为它失败并回到_aggregate_item_by_item (这就是为什么我重新定义df['year']fn ,这肯定会失败)。 But when you switched to return 0 , that succeeded, and so followed the try part. 但是当你切换到return 0 ,那就成功了,所以跟着try部分。

This is all just a bit of guesswork, but I think that's what's happening. 这只是一些猜测,但我认为这就是发生的事情。

I'm actually working on the group by code right now, and this issue has come up (see here ). 我现在正在按代码进行分组,这个问题已经出现了(见这里 )。 I don't think the function should ever be applied to the grouping column, but it sometimes is (R does the same). 我认为该功能不应该应用于分组列,但有时候 (R也是如此)。 Post there if you have an opinion on the matter. 如果您对此事有意见,请发布在那里。

如果year未包含在汇总中,您如何知道您汇总的群组?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM