简体   繁体   English

将数据框中的数据分组以针对Pandas / Python中的唯一ID生成列表

[英]Grouping data in a dataframe to produce lists against unique ids in Pandas/Python

Hi I am using the pandas/python and have a dataframe along the following lines: 嗨,我正在使用pandas / python,并具有以下几行的数据框:

21627   red
21627   green
21627   red
21627   blue
21627   purple
21628   yellow
21628   red
21628   green
21629   red
21629   red

Which I want to reduce to: 我想简化为:

21627   red, green, blue, purple
21628   yellow, red, green
21629   red

Whats the best way of doing this (and collapsing all values in lists to unique values)? 最好的方法是什么(将列表中的所有值折叠为唯一值)?

Also, if I wanted to keep the redundancy: 另外,如果我想保留冗余:

21627   red, green, red, blue, purple
21628   yellow, red, green
21629   red, red

Whats the best way of achieving this? 实现此目标的最佳方法是什么?

Thanks in advance for any help. 在此先感谢您的帮助。

If you really wanted to do this you could use a groupby apply: 如果您确实想这样做,可以使用groupby apply:

In [11]: df.groupby('id').apply(lambda x: list(set(x['colours'])))
Out[11]: 
id
21627    [blue, purple, green, red]
21628          [green, red, yellow]
21629                         [red]
dtype: object

In [12]: df.groupby('id').apply(lambda x: list(x['colours']))
Out[12]: 
id
21627    [red, green, red, blue, purple]
21628               [yellow, red, green]
21629                         [red, red]
dtype: object

However, DataFrames containing lists are not particularly efficient. 但是,包含列表的DataFrame并不是特别有效。

Pivot table gets you a more useful DataFrame: 数据透视表为您提供了更有用的DataFrame:

In [21]: df.pivot_table(rows='id', cols='colours', aggfunc=len, fill_value=0)
Out[21]: 
colours  blue  green  purple  red  yellow
id                                       
21627       1      1       1    2       0
21628       0      1       0    1       1
21629       0      0       0    2       0

My favourite function get_dummies lets you do it but not as elegantly or efficiently (but I'll keep this original, if crazy, suggestion): 我最喜欢的函数get_dummies可以使您做到这一点,但是却不那么优雅或有效(但如果有任何建议,我会保留原来的建议):

In [22]: pd.get_dummies(df.set_index('id')['colours']).reset_index().groupby('id').sum()
Out[22]: 
       blue  green  purple  red  yellow
id                                     
21627     1      1       1    2       0
21628     0      1       0    1       1
21629     0      0       0    2       0

Here's another way; 这是另一种方式; Though @Andy's a bit more intuitve 虽然@Andy有点直觉

In [24]: df.groupby('id').apply(
              lambda x: x['color'].value_counts()).unstack().fillna(0)
Out[24]: 
       blue  green  purple  red  yellow
id                                     
21627     1      1       1    2       0
21628     0      1       0    1       1
21629     0      0       0    2       0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM