[英]Retrieving specific number of rows in group by pandas
I have this dataframe.我有这个数据框。
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],
'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})
df
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
10 60 2000 2.0
11 90 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
16 30 3000 3.5
17 40 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
22 50 500 3.5
23 60 500 4.5
userId
can be taken as pairs for understanding purpose eg[(10,20),(60,90),(30,40),(50,60)]
.出于理解目的,可以将userId
成对使用, eg[(10,20),(60,90),(30,40),(50,60)]
。userId = 60
is twice.此外,一个用户可以出现在多个配对中,如在此数据框中,例如userId = 60
是两次。eg first 4
entries from each pair.我想从每对中选择eg first 4
个条目。**Expected Outcome**
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
You can convert pairs to tuples per groups with Series.map
and then call GroupBy.head
:您可以使用Series.map
对转换为每组元组,然后调用GroupBy.head
:
s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))
df = df.groupby(s).head(6)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
8 30 1900 3.5
9 40 1900 4.5
10 30 2000 2.0
11 40 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
16 50 3000 3.5
17 60 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
EDIT:编辑:
If is necessary filtering by consecutive movieID
:如果需要按连续的movieID
过滤:
tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
EDIT:编辑:
Is it better to exclude every 2 rows after picking first 4?选择前 4 行后最好每 2 行排除一次吗? It will do the job.它会完成这项工作。 Any suggestions?有什么建议? I mean it will pick 4 then remove next 2 and pick another 4 and remove next 2 and so on.我的意思是它会选择 4,然后删除下一个 2,再选择 4 个,然后删除下一个 2,依此类推。
You can use modulo of 6
with index values, then filter by condition and boolean indexing
:您可以对索引值使用6
模数,然后按条件和boolean indexing
进行过滤:
#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.