[英]Retrieving specific number of rows in group by pandas
我有这个数据框。
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],
'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})
df
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
10 60 2000 2.0
11 90 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
16 30 3000 3.5
17 40 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
22 50 500 3.5
23 60 500 4.5
userId
成对使用, eg[(10,20),(60,90),(30,40),(50,60)]
。userId = 60
是两次。eg first 4
个条目。**Expected Outcome**
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
您可以使用Series.map
对转换为每组元组,然后调用GroupBy.head
:
s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))
df = df.groupby(s).head(6)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
8 30 1900 3.5
9 40 1900 4.5
10 30 2000 2.0
11 40 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
16 50 3000 3.5
17 60 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
编辑:
如果需要按连续的movieID
过滤:
tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
编辑:
选择前 4 行后最好每 2 行排除一次吗? 它会完成这项工作。 有什么建议? 我的意思是它会选择 4,然后删除下一个 2,再选择 4 个,然后删除下一个 2,依此类推。
您可以对索引值使用6
模数,然后按条件和boolean indexing
进行过滤:
#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.