[英]Get top 5 matching rows from pandas dataframe for many criteria?
I have a dataframe containing many rows of the following form. 我有一个包含以下形式的许多行的数据框。
> all_rel = pandas.read_csv('../data/sv_abundances.csv')
> all_rel.head()
name day sample count tax_id rel
0 seq00000079;size=189384 204 37 1060 CYCL 0.122275
1 seq00000102;size=143633 204 37 639 SPLEN 0.073711
2 seq00000123;size=118889 204 37 813 723171 0.093782
3 seq00000326;size=50743 204 13 470 553239 0.097571
4 seq00000332;size=49099 204 13 468 TAS 0.097156
My goal is to get the top 5 rows sorted by the rel
column for each unique combination of day, sample, and count. 我的目标是针对天,样本和计数的每种唯一组合,按
rel
列排序前5行。 I have the unique combinations in a dataframe: 我在数据框中具有唯一的组合:
#get combinations of days, tax_ids, and samples present in dataset
> t = all_rel.drop_duplicates(['day', 'tax_id', 'sample'])[['day', 'tax_id', 'sample']]
> t.head()
day tax_id sample
0 204 CYCL 37
1 204 SPLEN 37
2 204 723171 37
3 204 553239 13
4 204 TAS 13
The only way I know to accomplish the goal is to use a for loop to iterate over the unique combinations and build up a dataframe. 我知道达到目标的唯一方法是使用for循环遍历唯一组合并构建数据框。
hacky_df = pandas.DataFrame()
for (day, tax_id, sample) in t.values:
match = all_rel[(all_rel['tax_id']==tax_id) & (all_rel['day']==day) & (all_rel['sample']==sample)]
top_5 = match.sort('rel', ascending=False).head()
hacky_df.append(top_5)
hacky_df.head()
But this takes a long time (still hasn't finished) and doesn't take advantage of the fact that these are numpy arrays under the hood. 但这要花很长时间(仍未完成),并且没有利用这些是幕后的numpy数组的事实。 Is there a way to accomplish my goal with a
pandas.df.apply
call instead of using a for loop? 有没有一种方法可以通过
pandas.df.apply
调用而不是使用for循环来实现我的目标?
The following code gave the intended results: 以下代码给出了预期的结果:
top_5_df = all_rel.sort('rel', ascending=False).groupby(['day', 'tax_id', 'sample']).head(5).sort(['day', 'sample', 'tax_id'])
print top_5_df.head(20)
name day sample count tax_id rel
136 seq00025622;size=605 204 13 28 188144 0.005813
2596 seq07169587;size=2 204 13 2 188144 0.000415
2438 seq05675680;size=2 204 13 2 188144 0.000415
2419 seq05517001;size=2 204 13 2 188144 0.000415
2123 seq03049127;size=3 204 13 1 188144 0.000208
4448 seq42562010;size=1 204 13 1 28173 0.000208
60 seq00008910;size=1787 204 13 15 335972 0.003114
1074 seq00182900;size=72 204 13 2 335972 0.000415
2151 seq03232487;size=3 204 13 1 335972 0.000208
3302 seq20519515;size=1 204 13 1 335972 0.000208
2451 seq05760125;size=2 204 13 1 335972 0.000208
750 seq00099976;size=139 204 13 23 428643 0.004775
2546 seq06674971;size=2 204 13 2 428643 0.000415
2207 seq03714229;size=3 204 13 1 428643 0.000208
3234 seq19173942;size=1 204 13 1 428643 0.000208
3201 seq18402810;size=1 204 13 1 428643 0.000208
3 seq00000326;size=50743 204 13 470 553239 0.097571
531 seq00066543;size=216 204 13 45 553239 0.009342
72 seq00010509;size=1528 204 13 17 553239 0.003529
117 seq00021191;size=745 204 13 11 553239 0.002284
df.groupby().head()
will call head()
on each group independently and return a dataframe of the resulting rows. df.groupby().head()
将在每个组上独立调用head()
并返回结果行的数据帧。
Here are the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration 这里是文档: http : //pandas.pydata.org/pandas-docs/stable/groupby.html#渗滤
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.