[英]efficient way to union non-set iterables within groups
I have this df
我有这个
df
df = pd.DataFrame(dict(
A=['b', 'a', 'b', 'c', 'a', 'c', 'a', 'c', 'a', 'a'],
B=[[0, 2, 3, 1],
[9, 6, 7, 2],
[6, 0, 1, 4],
[9, 2, 5, 1],
[5, 1, 4, 8],
[8, 5, 6, 6],
[0, 9, 0, 0],
[2, 6, 1, 8],
[7, 3, 2, 6],
[8, 7, 1, 9]]
))
I want to group by 'A'
and union all the lists in 'B'
我想按
'A'
分组并联合'B'
所有列表
Neither df.groupby('A').B.union()
nor df.groupby('A').B.apply(set.union)
work. 既不是
df.groupby('A').B.union()
也不是df.groupby('A').B.apply(set.union)
工作。
I want the result to be 我想要结果
A
a {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
b {0, 1, 2, 3, 4, 6}
c {1, 2, 5, 6, 8, 9}
Name: B, dtype: object
The problem is that you need to cast them as sets first before applying the union. 问题是你需要在应用联合之前先将它们作为集合进行转换。 One solution would be to use
sum
to concatenate the groups, then cast to set using map
一种解决方案是使用
sum
来连接组,然后使用map
为set
In [28]: df.groupby('A').B.sum().map(set)
Out[28]:
A
a {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
b {0, 1, 2, 3, 4, 6}
c {1, 2, 5, 6, 8, 9}
dtype: object
maxymoo's answer is nice, but since it first adds all the lists together it might unnecessarily take a lot of memory (especially so if there are lots of duplicates). maxymoo的答案很好,但由于它首先将所有列表添加到一起,因此可能不必要地占用大量内存(特别是如果有大量重复内容)。
Instead, you should first convert column B
to sets, after which you can reduce to a single set much more efficiently. 相反,您应该首先将列
B
转换为集合,之后您可以更有效地减少到单个集合。 Like this: 像这样:
df['B'] = df['B'].map(set)
A B
0 b {0, 1, 2, 3}
1 a {9, 2, 6, 7}
2 b {0, 1, 4, 6}
3 c {9, 2, 5, 1}
4 a {8, 1, 4, 5}
5 c {8, 5, 6}
6 a {0, 9}
7 c {8, 1, 2, 6}
8 a {2, 3, 6, 7}
9 a {8, 1, 9, 7}
df.groupby('A').B.apply(lambda x: reduce(set.union, x))
A
a {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
b {0, 1, 2, 3, 4, 6}
c {1, 2, 5, 6, 8, 9}
Name: B, dtype: object
Or, as a one-liner, as maxymoo points out: 或者,作为一个单行,如maxymoo指出:
df.groupby('A').B.apply(lambda x : reduce(set.union, x.map(set)))
I'd use a function to apply with 我会使用一个函数来申请
def f(x):
# grabbing first one so I can
# make a set out of it
first, *rest = x.values.tolist()
# union won't work unless it's on
# a set, it doesn't care about the rest
return set(first).union(*rest)
df.groupby('A').B.apply(f)
A
a {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
b {0, 1, 2, 3, 4, 6}
c {1, 2, 5, 6, 8, 9}
Name: B, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.