[英]How to get top n records from each category in a Python dataframe?
The data is sorted in descending order on column 'id' in the following dataframe -数据在以下数据框中的“id”列上按降序排序 -
id Name version copies price
6 MSFT 10.0 5 100
6 TSLA 10.0 10 200
6 ORCL 10.0 15 300
5 MSFT 10.0 20 400
5 TSLA 10.0 25 500
5 ORCL 10.0 30 600
4 MSFT 10.0 35 700
4 TSLA 10.0 40 800
4 ORCL 10.0 45 900
3 MSFT 5.0 50 1000
3 TSLA 5.0 55 1100
3 ORCL 5.0 60 1200
2 MSFT 5.0 65 1300
2 TSLA 5.0 70 1400
2 ORCL 5.0 75 1500
1 MSFT 15.0 80 1600
1 TSLA 15.0 85 1700
1 ORCL 15.0 90 1800
...
Based on the input 'n', I would like to filter above data such that, if input is '2', the resulting dataframe should look like -基于输入'n',我想过滤上面的数据,如果输入是'2',结果数据框应该看起来像 -
Name version copies price
MSFT 10.0 5 100
TSLA 10.0 10 200
ORCL 10.0 15 300
MSFT 10.0 20 400
TSLA 10.0 25 500
ORCL 10.0 30 600
MSFT 5.0 50 1000
TSLA 5.0 55 1100
ORCL 5.0 60 1200
MSFT 5.0 65 1300
TSLA 5.0 70 1400
ORCL 5.0 75 1500
MSFT 15.0 80 1600
TSLA 15.0 85 1700
ORCL 15.0 90 1800
Basically, only the top 'n' groups of 'id' for a specific version should be present in the resulting dataframe.基本上,只有特定版本的“id”的前“n”组应该出现在结果数据框中。 If a version has id's < n (eg in version 15.0 there is only one group with id = 1), then all the groups of id's should be present.
如果一个版本的 id < n(例如,在 15.0 版中只有一个组的 id = 1),那么所有组的 id 都应该存在。
I tried using groupy
and head
, but it didn't work for me.我尝试使用
groupy
和head
,但它对我不起作用。 I absolutely have no other clue in getting this to work.我绝对没有其他线索可以让这个工作。
I really appreciate any help with this, thank you.我非常感谢您对此的任何帮助,谢谢。
you can use groupby.transform
on the column version, and factorize
the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc
with this mask to select the wanted rows.您可以在列版本上使用
groupby.transform
,并将列 id factorize
为每个组的每个 id 的增量值(从 0 到 ...),然后与您的 n 进行比较并使用带有此掩码的loc
来选择想要的行。
n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
id Name version copies price
0 6 MSFT 10.0 5 100
1 6 TSLA 10.0 10 200
2 6 ORCL 10.0 15 300
3 5 MSFT 10.0 20 400
4 5 TSLA 10.0 25 500
5 5 ORCL 10.0 30 600
9 3 MSFT 5.0 50 1000
10 3 TSLA 5.0 55 1100
11 3 ORCL 5.0 60 1200
12 2 MSFT 5.0 65 1300
13 2 TSLA 5.0 70 1400
14 2 ORCL 5.0 75 1500
15 1 MSFT 15.0 80 1600
16 1 TSLA 15.0 85 1700
17 1 ORCL 15.0 90 1800
Another option is to use groupby.head
once you drop_duplicated
to keep unique version-id couples.另一种选择是使用
groupby.head
一旦你drop_duplicated
保留唯一版本-ID夫妇。 then use select version-id in a merge
.然后在
merge
使用 select version-id 。
df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.