简体   繁体   English

Pandas groupby:如何获得字符串的并集

[英]Pandas groupby: How to get a union of strings

I have a dataframe like this:我有一个这样的数据框:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling打电话

In [10]: print df.groupby("A")["B"].sum()

will return将返回

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do "the same" for column "C".现在我想对列“C”做“相同的”。 Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings).因为该列包含字符串,所以 sum() 不起作用(尽管您可能认为它会连接字符串)。 What I would really like to see is a list or set of the strings for each group, ie我真正想看到的是每个组的字符串列表或集合,即

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.我一直在努力寻找方法来做到这一点。

Series.unique() ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html ) doesn't work, although Series.unique() ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html )不起作用,虽然

df.groupby("A")["B"]

is a是一个

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work.所以我希望任何系列方法都能奏效。 Any ideas?有任何想法吗?

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns.当您应用自己的函数时,不会自动排除非数字列。 This is slower, though, than the application of .sum() to the groupby但是,这比将.sum()应用于groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates默认情况下sum连接

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want你几乎可以做你想做的事

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time.在整个框架上执行此操作,一次一组。 Key is to return a Series关键是返回一个Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}

You can use the apply method to apply an arbitrary function to the grouped data.您可以使用apply方法将任意函数应用于分组数据。 So if you want a set, apply set .所以如果你想要一个集合,应用set If you want a list, apply list .如果你想要一个列表,应用list

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then apply that.如果您想要其他东西,只需编写一个函数来执行您想要的操作,然后apply它。

You may be able to use the aggregate (or agg ) function to concatenate the values.您可以使用aggregate (或agg )函数来连接值。 (Untested code) (未经测试的代码)

df.groupby('A')['B'].agg(lambda col: ''.join(col))

你可以试试这个:

df.groupby('A').agg({'B':'sum','C':'-'.join})

Named aggregations with pandas >= 0.25.0 pandas >= 0.25.0命名聚合

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns.从 Pandas 0.25.0 版本开始,我们已经命名了聚合,我们可以在其中分组、聚合并同时为我们的列分配新名称。 This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:这样我们就不会得到 MultiIndex 列,考虑到它们包含的数据,列名更有意义:


aggregate and get a list of strings聚合并获取字符串列表

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

aggregate and join the strings聚合并连接字符串

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random

一个简单的解决方案是:

>>> df.groupby(['A','B']).c.unique().reset_index()

如果您想覆盖数据框中的 B 列,这应该有效:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:遵循@Erfan 的好答案,大多数情况下,在对聚合值的分析中,您需要这些现有字符值的独特可能组合:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM