如何从 Python 数据框中的每个类别中获取前 n 条记录？

Question

The data is sorted in descending order on column 'id' in the following dataframe -数据在以下数据框中的“id”列上按降序排序 -

id   Name     version     copies   price
6    MSFT       10.0        5       100   
6    TSLA       10.0        10      200
6    ORCL       10.0        15      300

5    MSFT       10.0        20      400
5    TSLA       10.0        25      500
5    ORCL       10.0        30      600

4    MSFT       10.0        35      700
4    TSLA       10.0        40      800
4    ORCL       10.0        45      900

3    MSFT       5.0         50      1000 
3    TSLA       5.0         55      1100
3    ORCL       5.0         60      1200

2    MSFT       5.0         65      1300
2    TSLA       5.0         70      1400
2    ORCL       5.0         75      1500

1    MSFT       15.0        80      1600
1    TSLA       15.0        85      1700
1    ORCL       15.0        90      1800
...

Based on the input 'n', I would like to filter above data such that, if input is '2', the resulting dataframe should look like -基于输入'n'，我想过滤上面的数据，如果输入是'2'，结果数据框应该看起来像 -

Name     version     copies   price
MSFT       10.0        5       100   
TSLA       10.0        10      200
ORCL       10.0        15      300

MSFT       10.0        20      400
TSLA       10.0        25      500
ORCL       10.0        30      600

MSFT       5.0         50      1000 
TSLA       5.0         55      1100
ORCL       5.0         60      1200

MSFT       5.0         65      1300
TSLA       5.0         70      1400
ORCL       5.0         75      1500

MSFT       15.0        80      1600
TSLA       15.0        85      1700
ORCL       15.0        90      1800

Basically, only the top 'n' groups of 'id' for a specific version should be present in the resulting dataframe.基本上，只有特定版本的“id”的前“n”组应该出现在结果数据框中。 If a version has id's < n (eg in version 15.0 there is only one group with id = 1), then all the groups of id's should be present.如果一个版本的 id < n（例如，在 15.0 版中只有一个组的 id = 1），那么所有组的 id 都应该存在。

I tried using groupy and head , but it didn't work for me.我尝试使用groupy和head ，但它对我不起作用。 I absolutely have no other clue in getting this to work.我绝对没有其他线索可以让这个工作。

I really appreciate any help with this, thank you.我非常感谢您对此的任何帮助，谢谢。

Answer 1

you can use groupby.transform on the column version, and factorize the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc with this mask to select the wanted rows.您可以在列版本上使用groupby.transform ，并将列 id factorize为每个组的每个 id 的增量值（从 0 到 ...），然后与您的 n 进行比较并使用带有此掩码的loc来选择想要的行。

n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
    id  Name  version  copies  price
0    6  MSFT     10.0       5    100
1    6  TSLA     10.0      10    200
2    6  ORCL     10.0      15    300
3    5  MSFT     10.0      20    400
4    5  TSLA     10.0      25    500
5    5  ORCL     10.0      30    600
9    3  MSFT      5.0      50   1000
10   3  TSLA      5.0      55   1100
11   3  ORCL      5.0      60   1200
12   2  MSFT      5.0      65   1300
13   2  TSLA      5.0      70   1400
14   2  ORCL      5.0      75   1500
15   1  MSFT     15.0      80   1600
16   1  TSLA     15.0      85   1700
17   1  ORCL     15.0      90   1800

Another option is to use groupby.head once you drop_duplicated to keep unique version-id couples.另一种选择是使用groupby.head一旦你drop_duplicated保留唯一版本-ID夫妇。 then use select version-id in a merge .然后在merge使用 select version-id 。

df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))

如何从 Python 数据框中的每个类别中获取前 n 条记录？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-07-23 19:59:31

如何从 Python 数据框中的每个类别中获取前 n 条记录？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-07-23 19:59:31

解决方案1
2 已采纳 2021-07-23 19:59:31