简体   繁体   English

重复提取熊猫中的行组的有效方法

[英]Efficient way to repeatedly extract groups of rows in pandas

I have a pandas DataFrame that looks in essence like the following: 我有一个pandas DataFrame,其本质如下图所示:

Group   Date    Value   etc.
1       01/01   10
1       05/01   10
1       08/01   5
1       15/01   5
1       18/01   2
1       21/01   10
...
2       02/01   3
2       15/01   4
2       25/01   1
...
3       01/01   6
....

I would like extract each Group to a separate pandas dataframe containing all the rows in that group (eg to a dictionary with keys 1,2,3 etc). 我想将每个Group提取到包含该组中所有行的单独的pandas数据帧中(例如,提取具有键1,2,3等的字典)。 The obvious way to do this is using a slice (like df[df.Group == 1] ) looping through the groups. 这样做的明显方法是使用切片(例如df[df.Group == 1] )遍历各个组。

However, with a very quite large data set (700k rows, with 30k groups), the slice technique is quite slow because the entire 700k transactions must be accessed for each of the 30k groups. 但是,对于非常大的数据集(700k行,30k组),切片技术非常慢,因为必须为30k组中的每个组访问整个700k事务。

Any suggestions for a faster method, where each of the 700k rows only has to be accessed once to perform the groupings? 对于更快的方法有何建议?在这种方法中,仅需访问一次700k行中的每一个即可进行分组? Thanks! 谢谢!

I don't know why you'd want a separate df for each group, I'd just groupby on the 'group' and use the groups attribute to index back into the orig df, or use get_group : 我不知道为什么要为每个组单独分配df,我只是对“ group”进行groupby并使用groups属性将索引重新索引到orig df中,或者使用get_group

In [79]:
groups = df.groupby('Group')
groups.groups

Out[79]:
{1: [0, 1, 2, 3, 4, 5], 2: [6, 7, 8], 3: [9]}

In [81]:    
groups.get_group(1)

Out[81]:
   Group   Date  Value
0      1  01/01     10
1      1  05/01     10
2      1  08/01      5
3      1  15/01      5
4      1  18/01      2
5      1  21/01     10

In [82]:    
df.loc[groups.groups[1]]

Out[82]:
   Group   Date  Value
0      1  01/01     10
1      1  05/01     10
2      1  08/01      5
3      1  15/01      5
4      1  18/01      2
5      1  21/01     10

You could use groupby on the Group column. 您可以在“ Group列上使用groupby This will get you all groups and you will be able to process each group with a function - 这将为您提供所有组,并且您将能够使用功能处理每个组-

df.groupby('Group').<apply function here>

For example - 例如 -

In [13]: df
Out[13]: 
    Group   Date  Value
0       1  01/01     10
1       1  05/01     10
2       1  08/01      5
3       1  15/01      5
4       1  18/01      2
5       1  21/01     10
6       2  15/01      5
7       2  18/01      2
8       1  21/01     10
9       1  15/01      5
10      5  18/01      2
11      5  21/01     10

In [14]: df.groupby('Group').groups
Out[14]: {1: [0, 1, 2, 3, 4, 5, 8, 9], 2: [6, 7], 5: [10, 11]}

In [15]: grp = df.groupby('Group')

This gets you the Groups 1: 这使您获得第1组:

In [16]: grp.get_group(1)
Out[16]: 
   Group   Date  Value
0      1  01/01     10
1      1  05/01     10
2      1  08/01      5
3      1  15/01      5
4      1  18/01      2
5      1  21/01     10
8      1  21/01     10
9      1  15/01      5

The documentation here will help you further - http://pandas.pydata.org/pandas-docs/dev/groupby.html 此处的文档将帮助您进一步-http://pandas.pydata.org/pandas-docs/dev/groupby.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM