简体   繁体   English

从pandas数据框中获取前5个匹配行以获取许多条件?

[英]Get top 5 matching rows from pandas dataframe for many criteria?

I have a dataframe containing many rows of the following form. 我有一个包含以下形式的许多行的数据框。

> all_rel = pandas.read_csv('../data/sv_abundances.csv')
> all_rel.head()
    name                    day sample  count   tax_id  rel
0   seq00000079;size=189384 204 37      1060    CYCL    0.122275
1   seq00000102;size=143633 204 37      639     SPLEN   0.073711
2   seq00000123;size=118889 204 37      813     723171  0.093782
3   seq00000326;size=50743  204 13      470     553239  0.097571
4   seq00000332;size=49099  204 13      468     TAS     0.097156

My goal is to get the top 5 rows sorted by the rel column for each unique combination of day, sample, and count. 我的目标是针对天,样本和计数的每种唯一组合,按rel列排序前5行。 I have the unique combinations in a dataframe: 我在数据框中具有唯一的组合:

#get combinations of days, tax_ids, and samples present in dataset
> t = all_rel.drop_duplicates(['day', 'tax_id', 'sample'])[['day', 'tax_id', 'sample']]
> t.head()

   day  tax_id  sample
0  204    CYCL      37
1  204   SPLEN      37
2  204  723171      37
3  204  553239      13
4  204     TAS      13

The only way I know to accomplish the goal is to use a for loop to iterate over the unique combinations and build up a dataframe. 我知道达到目标的唯一方法是使用for循环遍历唯一组合并构建数据框。

hacky_df = pandas.DataFrame()
for (day, tax_id, sample) in t.values:
    match = all_rel[(all_rel['tax_id']==tax_id) & (all_rel['day']==day) & (all_rel['sample']==sample)]
    top_5 = match.sort('rel', ascending=False).head()
    hacky_df.append(top_5)
hacky_df.head()

But this takes a long time (still hasn't finished) and doesn't take advantage of the fact that these are numpy arrays under the hood. 但这要花很长时间(仍未完成),并且没有利用这些是幕后的numpy数组的事实。 Is there a way to accomplish my goal with a pandas.df.apply call instead of using a for loop? 有没有一种方法可以通过pandas.df.apply调用而不是使用for循环来实现我的目标?

The following code gave the intended results: 以下代码给出了预期的结果:

top_5_df = all_rel.sort('rel', ascending=False).groupby(['day', 'tax_id', 'sample']).head(5).sort(['day', 'sample', 'tax_id'])
print top_5_df.head(20)
                        name  day  sample  count  tax_id       rel
136     seq00025622;size=605  204      13     28  188144  0.005813
2596      seq07169587;size=2  204      13      2  188144  0.000415
2438      seq05675680;size=2  204      13      2  188144  0.000415
2419      seq05517001;size=2  204      13      2  188144  0.000415
2123      seq03049127;size=3  204      13      1  188144  0.000208
4448      seq42562010;size=1  204      13      1   28173  0.000208
60     seq00008910;size=1787  204      13     15  335972  0.003114
1074     seq00182900;size=72  204      13      2  335972  0.000415
2151      seq03232487;size=3  204      13      1  335972  0.000208
3302      seq20519515;size=1  204      13      1  335972  0.000208
2451      seq05760125;size=2  204      13      1  335972  0.000208
750     seq00099976;size=139  204      13     23  428643  0.004775
2546      seq06674971;size=2  204      13      2  428643  0.000415
2207      seq03714229;size=3  204      13      1  428643  0.000208
3234      seq19173942;size=1  204      13      1  428643  0.000208
3201      seq18402810;size=1  204      13      1  428643  0.000208
3     seq00000326;size=50743  204      13    470  553239  0.097571
531     seq00066543;size=216  204      13     45  553239  0.009342
72     seq00010509;size=1528  204      13     17  553239  0.003529
117     seq00021191;size=745  204      13     11  553239  0.002284

df.groupby().head() will call head() on each group independently and return a dataframe of the resulting rows. df.groupby().head()将在每个组上独立调用head()并返回结果行的数据帧。

Here are the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration 这里是文档: http : //pandas.pydata.org/pandas-docs/stable/groupby.html#渗滤

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM