[英]How to filter rows by group in a pandas dataframe
Suppose now I have some group data like假设现在我有一些组数据,比如
GroupID![]() |
ID ![]() |
Rank![]() |
target![]() |
---|---|---|---|
A![]() |
1 ![]() |
1 ![]() |
0 ![]() |
A![]() |
2 ![]() |
3 ![]() |
0 ![]() |
A![]() |
3 ![]() |
2 ![]() |
1 ![]() |
B![]() |
1 ![]() |
1 ![]() |
0 ![]() |
B![]() |
2 ![]() |
4 ![]() |
0 ![]() |
B![]() |
3 ![]() |
3 ![]() |
1 ![]() |
B![]() |
4 ![]() |
2 ![]() |
0 ![]() |
C ![]() |
1 ![]() |
1 ![]() |
1 ![]() |
C ![]() |
2 ![]() |
4 ![]() |
0 ![]() |
C ![]() |
3 ![]() |
3 ![]() |
1 ![]() |
C ![]() |
4 ![]() |
2 ![]() |
0 ![]() |
D![]() |
1 ![]() |
1 ![]() |
0 ![]() |
D![]() |
2 ![]() |
4 ![]() |
0 ![]() |
D![]() |
3 ![]() |
3 ![]() |
0 ![]() |
D![]() |
4 ![]() |
2 ![]() |
0 ![]() |
For each group,对于每个组,
I want to filter the group which has no rows which target=1.我想过滤没有 target=1 行的组。
Then I want to keep the row which target==1 and the rows which rank is higher than it.然后我想保留 target==1 的行和排名高于它的行。 Some group may have many rows which target==1, and we choose the one which rank is lower as our target.
某些组可能有很多行目标== 1,我们选择排名较低的行作为我们的目标。 For example for group C, the ID=1 and ID=3 all have target==1, we will keep the rows which the rank<=3.
例如对于组C,ID=1和ID=3都有target==1,我们将保留rank<=3的行。 So we will get
所以我们会得到
GroupID![]() |
ID ![]() |
Rank![]() |
target![]() |
---|---|---|---|
A![]() |
1 ![]() |
1 ![]() |
0 ![]() |
A![]() |
3 ![]() |
2 ![]() |
1 ![]() |
B![]() |
1 ![]() |
1 ![]() |
0 ![]() |
B![]() |
3 ![]() |
3 ![]() |
1 ![]() |
B![]() |
4 ![]() |
2 ![]() |
0 ![]() |
C ![]() |
1 ![]() |
1 ![]() |
1 ![]() |
C ![]() |
3 ![]() |
3 ![]() |
1 ![]() |
C ![]() |
4 ![]() |
2 ![]() |
0 ![]() |
IIUC, make a first pass to slice the rows with target == 1 (using eq
), then get the max rank per group using GroupBy.max
and select the rows with this maximum rank per group with classical boolean indexing using le
: IIUC,首先通过 target == 1 对行进行切片(使用
eq
),然后使用GroupBy.max
获得每组的最大排名,并使用le
使用经典的 boolean 索引获得每组具有此最大排名的行 select :
thresh = df[df['target'].eq(1)].groupby('GroupID')['Rank'].max()
out = df[df['Rank'].le(df['GroupID'].map(thresh))]
output: output:
GroupID ID Rank target
0 A 1 1 0
2 A 3 2 1
3 B 1 1 0
5 B 3 3 1
6 B 4 2 0
7 C 1 1 1
9 C 3 3 1
10 C 4 2 0
thresholds:阈值:
>>> thresh
GroupID
A 2
B 3
C 3
Replace Rank
in Series.where
if target is not 1
and then use GroupBy.transform
for maximal Rank
per group, so possible compare Rank
column in boolean indexing
by Series.le
for less or equal:如果目标不是
1
,则替换Series.where
中的Rank
,然后使用GroupBy.transform
获取每组的最大Rank
,因此可以比较 boolean 中由Series.le
boolean indexing
的Rank
列是否小于或等于:
s = df['Rank'].where(df['target'].eq(1)).groupby(df['GroupID']).transform('max')
df = df[df['Rank'].le(s)]
print (df)
GroupID ID Rank target
0 A 1 1 0
2 A 3 2 1
3 B 1 1 0
5 B 3 3 1
6 B 4 2 0
7 C 1 1 1
9 C 3 3 1
10 C 4 2 0
Details :详情:
print (df['Rank'].where(df['target'].eq(1)))
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 1.0
8 NaN
9 3.0
10 NaN
Name: Rank, dtype: float64
print (s)
0 2.0
1 2.0
2 2.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
10 3.0
Name: Rank, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.