I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],
'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})
df
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
10 60 2000 2.0
11 90 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
16 30 3000 3.5
17 40 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
22 50 500 3.5
23 60 500 4.5
userId
can be taken as pairs for understanding purpose eg[(10,20),(60,90),(30,40),(50,60)]
.userId = 60
is twice.eg first 4
entries from each pair.**Expected Outcome**
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
You can convert pairs to tuples per groups with Series.map
and then call GroupBy.head
:
s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))
df = df.groupby(s).head(6)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
4 10 700 4.0
5 20 700 1.5
8 30 1900 3.5
9 40 1900 4.5
10 30 2000 2.0
11 40 2000 5.0
12 30 1600 4.0
13 40 1600 1.5
16 50 3000 3.5
17 60 3000 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
EDIT:
If is necessary filtering by consecutive movieID
:
tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
EDIT:
Is it better to exclude every 2 rows after picking first 4? It will do the job. Any suggestions? I mean it will pick 4 then remove next 2 and pick another 4 and remove next 2 and so on.
You can use modulo of 6
with index values, then filter by condition and boolean indexing
:
#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
userId movieId ratings
0 10 500 3.5
1 20 500 4.5
2 10 800 2.0
3 20 800 5.0
6 60 1100 3.5
7 90 1100 4.5
8 60 1900 3.5
9 90 1900 4.5
12 30 1600 4.0
13 40 1600 1.5
14 30 1901 3.5
15 40 1901 4.5
18 50 3025 2.0
19 60 3025 5.0
20 50 4000 4.0
21 60 4000 1.5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.