[英]Pandas groupby: How to get a union of strings
I have a dataframe like this:我有一个这样的数据框:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling打电话
In [10]: print df.groupby("A")["B"].sum()
will return将返回
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C".现在我想对列“C”做“相同的”。 Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings).
因为该列包含字符串,所以 sum() 不起作用(尽管您可能认为它会连接字符串)。 What I would really like to see is a list or set of the strings for each group, ie
我真正想看到的是每个组的字符串列表或集合,即
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.我一直在努力寻找方法来做到这一点。
Series.unique() ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html ) doesn't work, although Series.unique() ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html )不起作用,虽然
df.groupby("A")["B"]
is a是一个
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work.所以我希望任何系列方法都能奏效。 Any ideas?
有任何想法吗?
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns.当您应用自己的函数时,不会自动排除非数字列。 This is slower, though, than the application of
.sum()
to the groupby
但是,这比将
.sum()
应用于groupby
慢
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum
by default concatenates默认情况下
sum
连接
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want你几乎可以做你想做的事
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time.在整个框架上执行此操作,一次一组。 Key is to return a
Series
关键是返回一个
Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
You can use the apply
method to apply an arbitrary function to the grouped data.您可以使用
apply
方法将任意函数应用于分组数据。 So if you want a set, apply set
.所以如果你想要一个集合,应用
set
。 If you want a list, apply list
.如果你想要一个列表,应用
list
。
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply
that.如果您想要其他东西,只需编写一个函数来执行您想要的操作,然后
apply
它。
You may be able to use the aggregate
(or agg
) function to concatenate the values.您可以使用
aggregate
(或agg
)函数来连接值。 (Untested code) (未经测试的代码)
df.groupby('A')['B'].agg(lambda col: ''.join(col))
你可以试试这个:
df.groupby('A').agg({'B':'sum','C':'-'.join})
pandas >= 0.25.0
pandas >= 0.25.0
命名聚合Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns.从 Pandas 0.25.0 版本开始,我们已经命名了聚合,我们可以在其中分组、聚合并同时为我们的列分配新名称。 This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
这样我们就不会得到 MultiIndex 列,考虑到它们包含的数据,列名更有意义:
aggregate and get a list of strings聚合并获取字符串列表
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings聚合并连接字符串
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random
一个简单的解决方案是:
>>> df.groupby(['A','B']).c.unique().reset_index()
如果您想覆盖数据框中的 B 列,这应该有效:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:遵循@Erfan 的好答案,大多数情况下,在对聚合值的分析中,您需要这些现有字符值的独特可能组合:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.