Retrieving specific number of rows in group by pandas

Question

I have this dataframe.

from pandas import DataFrame
import pandas as pd

df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
                   'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],  
                   'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})

df
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
10      60     2000      2.0
11      90     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
16      30     3000      3.5
17      40     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5
22      50      500      3.5
23      60      500      4.5

In this dataframe two users have common movies between them.
The userId can be taken as pairs for understanding purpose eg[(10,20),(60,90),(30,40),(50,60)] .
As all of these pairs have common movies between them. After every 6 entries new pair entries are starting.
Moreover, one user can appear in multiple pairing as in this dataframe eg userId = 60 is twice.
I want to pick eg first 4 entries from each pair.

**Expected Outcome**

    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0

6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5

12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5

18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

Answer 1

You can convert pairs to tuples per groups with Series.map and then call GroupBy.head :

s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))

df = df.groupby(s).head(6)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
8       30     1900      3.5
9       40     1900      4.5
10      30     2000      2.0
11      40     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
16      50     3000      3.5
17      60     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:

If is necessary filtering by consecutive movieID :

tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:

Is it better to exclude every 2 rows after picking first 4? It will do the job. Any suggestions? I mean it will pick 4 then remove next 2 and pick another 4 and remove next 2 and so on.

You can use modulo of 6 with index values, then filter by condition and boolean indexing :

#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

Retrieving specific number of rows in group by pandas

Question

1 answers

solution1
2 ACCPTED 2019-12-30 09:13:00

Retrieving specific number of rows in group by pandas

Question

1 answers

solution1 2 ACCPTED 2019-12-30 09:13:00

solution1
2 ACCPTED 2019-12-30 09:13:00