簡體   English   中英

如果在組內不滿足任何條件,如何 select 所有行,如果在 pandas 中滿足組內的某些條件,如何 select 行的子集

[英]How to select all rows if no conditions are met within a group and select a subset of rows if certain conditions within a group are met in pandas

在dataframe下面:

pd.DataFrame({'customer': ['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust4', 'cust4'],
                   'year': [2017, 2018, 2019, 2018, 2019, 2017, 2018, 2018, 2019],
                   'score': [0.10, 0.59, 0.3, 0.44, 0.2, 0.78, 0.6, 0.37, .023]})

    customer    year    score
0   cust1   2017    0.100
1   cust1   2018    0.590
2   cust1   2019    0.300
3   cust2   2018    0.440
4   cust2   2019    0.200
5   cust3   2017    0.780
6   cust3   2018    0.600
7   cust4   2018    0.370
8   cust4   2019    0.023

我想過濾每組客戶中的數據。 條件是:

if the score >= 0.5: return only rows greater than 0.5 in that group
if no score is greater than 0.5 in a group: return all the rows

結果應如下所示:

    customer    year    cond
0   cust1   2018    0.590
1   cust2   2018    0.440
2   cust2   2019    0.200
3   cust3   2017    0.780
4   cust3   2018    0.600
5   cust4   2018    0.370
6   cust4   2019    0.023

鏈 2 條件 - 第一個掩碼用於測試是否大於或等於Series.ge ,第二個掩碼如果不匹配條件m則獲取所有customer

m = df['score'].ge(0.5)
df = df[m | ~df['customer'].isin(df.loc[m, 'customer'])]
print (df)
  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023

詳情

print (df.loc[m, 'customer'])
1    cust1
5    cust3
6    cust3
Name: customer, dtype: object

print (~df['customer'].isin(df.loc[m, 'customer']))
0    False
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8     True
Name: customer, dtype: bool

或者,如果性能對第二個掩碼GroupBy.transformGroupBy.any沒有重要用途 - 在大型數據幀中應該很慢:

m = df['score'].ge(0.5)
df = df[m | ~m.groupby(df['customer']).transform('any')]
print (df)
  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023

詳情

print (~m.groupby(df['customer']).transform('any'))
0    False
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8     True
Name: score, dtype: bool

您可以為boolean 索引使用兩個掩碼:

# is the score ≥ 0.5?
m1 = df['score'].ge(0.5)
# are none of values in the group ≥ 0.5
m2 = ~m1.groupby(df['customer']).transform('any')

# select if any condition matches
out = df[m1|m2]

Output:

  customer  year  score
1    cust1  2018  0.590
3    cust2  2018  0.440
4    cust2  2019  0.200
5    cust3  2017  0.780
6    cust3  2018  0.600
7    cust4  2018  0.370
8    cust4  2019  0.023

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM