[英]pandas concat arrays on groupby
I have a DataFrame which was created by group by with: 我有一个由group by创建的DataFrame:
agg_df = df.groupby(['X', 'Y', 'Z']).agg({
'amount':np.sum,
'ID': pd.Series.unique,
})
After I applied some filtering on agg_df
I want to concat the IDs 我在
agg_df
上应用了一些过滤后,我想连接ID
agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
'amount':np.sum,
'ID': pd.Series.unique,
})
But I get an error at the second 'ID': pd.Series.unique
: 但是我在第二个
'ID': pd.Series.unique
得到一个错误'ID': pd.Series.unique
:
ValueError: Function does not reduce
As an example the dataframe before the second groupby is: 作为示例,第二组之前的数据帧是:
|amount| ID |
-----+----+----+------+-------+
X | Y | Z | | |
-----+----+----+------+-------+
a1 | b1 | c1 | 10 | 2 |
| | c2 | 11 | 1 |
a3 | b2 | c3 | 2 | [5,7] |
| | c4 | 7 | 3 |
a5 | b3 | c3 | 12 | [6,3] |
| | c5 | 17 | [3,4] |
a7 | b4 | c6 | 2 | [8,9] |
And the expected outcome should be 预期的结果应该是
|amount| ID |
-----+----+------+-----------+
X | Y | | |
-----+----+------+-----------+
a1 | b1 | 21 | [2,1] |
a3 | b2 | 9 | [5,7,3] |
a5 | b3 | 29 | [6,3,4] |
a7 | b4 | 2 | [8,9] |
The order of the final IDs is not important. 最终ID的顺序并不重要。
Edit: I have come up with one solution. 编辑:我提出了一个解决方案。 But its not quite elegant:
但它不太优雅:
def combine_ids(x):
def asarray(elem):
if isinstance(elem, collections.Iterable):
return np.asarray(list(elem))
return elem
res = np.array([asarray(elem) for elem in x.values])
res = np.unique(np.hstack(res))
return set(res)
agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
'amount':np.sum,
'ID': combine_ids,
})
Edit2: Another solution which works in my case is: Edit2:在我的案例中有效的另一个解决方案是:
combine_ids = lambda x: set(np.hstack(x.values))
Edit3: It seems that it is not possible to avoid set()
as resulting value, due to implementation of Pandas aggregation function implemention. Edit3:由于Pandas聚合函数实现的实现,似乎无法避免
set()
作为结果值。 Details in https://stackoverflow.com/a/16975602/3142459 详情请访问https://stackoverflow.com/a/16975602/3142459
If you're fine using sets as your type (which I probably would), then I would go with: 如果您使用集合作为您的类型(我可能会),那么我会选择:
agg_df = df.groupby(['x','y','z']).agg({
'amount': np.sum, 'id': lambda s: set(s)})
agg_df.reset_index().groupby(['x','y']).agg({
'amount': np.sum, 'id': lambda s: set.union(*s)})
...which works for me. ......对我有用。 For some reason, the
lambda s: set(s)
works, but set doesn't (I'm guessing somewhere pandas isn't doing duck-typing correctly). 出于某种原因,
lambda s: set(s)
可以工作,但是set不起作用(我猜测某些地方的pandas没有正确地进行鸭子打字)。
If your data is large, you'll probably want the following instead of lambda s: set.union(*s)
: 如果您的数据很大,您可能需要以下代替
lambda s: set.union(*s)
:
from functools import reduce
# can't partial b/c args are positional-only
def cheaper_set_union(s):
return reduce(set.union, s, set())
When your aggregation function returns a Series, pandas won't necessarily know you want that packed into a single cell. 当您的聚合函数返回一个Series时,pandas不一定知道您希望将其打包到单个单元格中。 As a more general solution, just explicitly coerce the result to a list.
作为更通用的解决方案,只需将结果明确强制转换为列表即可。
agg_df = df.groupby(['X', 'Y', 'Z']).agg({
'amount':np.sum,
'ID': lambda x: list(x.unique()),
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.