groupby上的pandas concat数组

Question

I have a DataFrame which was created by group by with: 我有一个由group by创建的DataFrame：

agg_df = df.groupby(['X', 'Y', 'Z']).agg({
    'amount':np.sum,
    'ID': pd.Series.unique,
})

After I applied some filtering on agg_df I want to concat the IDs 我在agg_df上应用了一些过滤后，我想连接ID

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': pd.Series.unique,
})

But I get an error at the second 'ID': pd.Series.unique : 但是我在第二个'ID': pd.Series.unique得到一个错误'ID': pd.Series.unique ：

ValueError: Function does not reduce

As an example the dataframe before the second groupby is: 作为示例，第二组之前的数据帧是：

               |amount|  ID   |
-----+----+----+------+-------+
  X  | Y  | Z  |      |       |
-----+----+----+------+-------+
  a1 | b1 | c1 |  10  | 2     |
     |    | c2 |  11  | 1     |
  a3 | b2 | c3 |   2  | [5,7] |
     |    | c4 |   7  | 3     |
  a5 | b3 | c3 |  12  | [6,3] |
     |    | c5 |  17  | [3,4] |
  a7 | b4 | c6 |  2   | [8,9] |

And the expected outcome should be 预期的结果应该是

          |amount|  ID       |
-----+----+------+-----------+
  X  | Y  |      |           |
-----+----+------+-----------+
  a1 | b1 |  21  | [2,1]     |
  a3 | b2 |   9  | [5,7,3]   |
  a5 | b3 |  29  | [6,3,4]   |
  a7 | b4 |  2   | [8,9]     |

The order of the final IDs is not important. 最终ID的顺序并不重要。

Edit: I have come up with one solution. 编辑：我提出了一个解决方案。 But its not quite elegant: 但它不太优雅：

def combine_ids(x):
   def asarray(elem):
      if isinstance(elem, collections.Iterable):
         return np.asarray(list(elem))
      return elem

   res = np.array([asarray(elem) for elem in x.values])
   res = np.unique(np.hstack(res))
   return set(res)

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': combine_ids,
})

Edit2: Another solution which works in my case is: Edit2：在我的案例中有效的另一个解决方案是：

combine_ids = lambda x: set(np.hstack(x.values))

Edit3: It seems that it is not possible to avoid set() as resulting value, due to implementation of Pandas aggregation function implemention. Edit3：由于Pandas聚合函数实现的实现，似乎无法避免set()作为结果值。 Details in https://stackoverflow.com/a/16975602/3142459 详情请访问https://stackoverflow.com/a/16975602/3142459

Answer 1

If you're fine using sets as your type (which I probably would), then I would go with: 如果您使用集合作为您的类型（我可能会），那么我会选择：

agg_df = df.groupby(['x','y','z']).agg({
    'amount': np.sum, 'id': lambda s: set(s)})
agg_df.reset_index().groupby(['x','y']).agg({
    'amount': np.sum, 'id': lambda s: set.union(*s)})

...which works for me. ......对我有用。 For some reason, the lambda s: set(s) works, but set doesn't (I'm guessing somewhere pandas isn't doing duck-typing correctly). 出于某种原因， lambda s: set(s)可以工作，但是set不起作用（我猜测某些地方的pandas没有正确地进行鸭子打字）。

If your data is large, you'll probably want the following instead of lambda s: set.union(*s) : 如果您的数据很大，您可能需要以下代替lambda s: set.union(*s) ：

from functools import reduce
# can't partial b/c args are positional-only
def cheaper_set_union(s):
    return reduce(set.union, s, set())

Answer 2

When your aggregation function returns a Series, pandas won't necessarily know you want that packed into a single cell. 当您的聚合函数返回一个Series时，pandas不一定知道您希望将其打包到单个单元格中。 As a more general solution, just explicitly coerce the result to a list. 作为更通用的解决方案，只需将结果明确强制转换为列表即可。

agg_df = df.groupby(['X', 'Y', 'Z']).agg({
    'amount':np.sum,
    'ID': lambda x: list(x.unique()),
})

groupby上的pandas concat数组

问题描述

2 个解决方案

解决方案1
2 2015-10-15 23:07:04

解决方案2
0 2015-12-03 01:38:50

groupby上的pandas concat数组

问题描述

2 个解决方案

解决方案1 2 2015-10-15 23:07:04

解决方案2 0 2015-12-03 01:38:50

解决方案1
2 2015-10-15 23:07:04

解决方案2
0 2015-12-03 01:38:50