简体   繁体   English

通过熊猫检索组中的特定行数

[英]Retrieving specific number of rows in group by pandas

I have this dataframe.我有这个数据框。

from pandas import DataFrame
import pandas as pd

df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
                   'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],  
                   'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})



df
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
10      60     2000      2.0
11      90     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
16      30     3000      3.5
17      40     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5
22      50      500      3.5
23      60      500      4.5
  1. In this dataframe two users have common movies between them.在这个数据框中,两个用户之间有共同的电影。
  2. The userId can be taken as pairs for understanding purpose eg[(10,20),(60,90),(30,40),(50,60)] .出于理解目的,可以将userId成对使用, eg[(10,20),(60,90),(30,40),(50,60)]
  3. As all of these pairs have common movies between them.由于所有这些对之间都有共同的电影。 After every 6 entries new pair entries are starting.每 6 个条目后,新的配对条目就会开始。
  4. Moreover, one user can appear in multiple pairing as in this dataframe eg userId = 60 is twice.此外,一个用户可以出现在多个配对中,如在此数据框中,例如userId = 60是两次。
  5. I want to pick eg first 4 entries from each pair.我想从每对中选择eg first 4个条目。
**Expected Outcome**

    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0

6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5

12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5

18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5


You can convert pairs to tuples per groups with Series.map and then call GroupBy.head :您可以使用Series.map对转换为每组元组,然后调用GroupBy.head

s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))

df = df.groupby(s).head(6)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
8       30     1900      3.5
9       40     1900      4.5
10      30     2000      2.0
11      40     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
16      50     3000      3.5
17      60     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:编辑:

If is necessary filtering by consecutive movieID :如果需要按连续的movieID过滤:

tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:编辑:

Is it better to exclude every 2 rows after picking first 4?选择前 4 行后最好每 2 行排除一次吗? It will do the job.它会完成这项工作。 Any suggestions?有什么建议? I mean it will pick 4 then remove next 2 and pick another 4 and remove next 2 and so on.我的意思是它会选择 4,然后删除下一个 2,再选择 4 个,然后删除下一个 2,依此类推。

You can use modulo of 6 with index values, then filter by condition and boolean indexing :您可以对索引值使用6模数,然后按条件和boolean indexing进行过滤:

#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM