通过熊猫检索组中的特定行数

Question

我有这个数据框。

from pandas import DataFrame
import pandas as pd

df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
                   'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],  
                   'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})

df
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
10      60     2000      2.0
11      90     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
16      30     3000      3.5
17      40     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5
22      50      500      3.5
23      60      500      4.5

在这个数据框中，两个用户之间有共同的电影。
出于理解目的，可以将userId成对使用， eg[(10,20),(60,90),(30,40),(50,60)] 。
由于所有这些对之间都有共同的电影。 每 6 个条目后，新的配对条目就会开始。
此外，一个用户可以出现在多个配对中，如在此数据框中，例如userId = 60是两次。
我想从每对中选择eg first 4个条目。

**Expected Outcome**

    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0

6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5

12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5

18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

Answer 1

您可以使用Series.map对转换为每组元组，然后调用GroupBy.head ：

s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))

df = df.groupby(s).head(6)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
8       30     1900      3.5
9       40     1900      4.5
10      30     2000      2.0
11      40     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
16      50     3000      3.5
17      60     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

编辑：

如果需要按连续的movieID过滤：

tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

编辑：

选择前 4 行后最好每 2 行排除一次吗？ 它会完成这项工作。 有什么建议？ 我的意思是它会选择 4，然后删除下一个 2，再选择 4 个，然后删除下一个 2，依此类推。

您可以对索引值使用6模数，然后按条件和boolean indexing进行过滤：

#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

通过熊猫检索组中的特定行数

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-12-30 09:13:00

通过熊猫检索组中的特定行数

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-12-30 09:13:00

解决方案1
2 已采纳 2019-12-30 09:13:00