简体   繁体   中英

Retrieving specific number of rows in group by pandas

I have this dataframe.

from pandas import DataFrame
import pandas as pd

df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
                   'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],  
                   'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})



df
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
10      60     2000      2.0
11      90     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
16      30     3000      3.5
17      40     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5
22      50      500      3.5
23      60      500      4.5
  1. In this dataframe two users have common movies between them.
  2. The userId can be taken as pairs for understanding purpose eg[(10,20),(60,90),(30,40),(50,60)] .
  3. As all of these pairs have common movies between them. After every 6 entries new pair entries are starting.
  4. Moreover, one user can appear in multiple pairing as in this dataframe eg userId = 60 is twice.
  5. I want to pick eg first 4 entries from each pair.
**Expected Outcome**

    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0

6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5

12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5

18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5


You can convert pairs to tuples per groups with Series.map and then call GroupBy.head :

s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))

df = df.groupby(s).head(6)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
8       30     1900      3.5
9       40     1900      4.5
10      30     2000      2.0
11      40     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
16      50     3000      3.5
17      60     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:

If is necessary filtering by consecutive movieID :

tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

EDIT:

Is it better to exclude every 2 rows after picking first 4? It will do the job. Any suggestions? I mean it will pick 4 then remove next 2 and pick another 4 and remove next 2 and so on.

You can use modulo of 6 with index values, then filter by condition and boolean indexing :

#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM