[英]How to filter a pandas dataframe by string values and matching integers in rows?
謝謝您的幫助。 我對 pandas 還比較陌生,並且沒有在搜索結果中觀察到這種特定類型的查詢。
我有一個 pandas dataframe:
+-----+---------+----------+
| id | value | match_id |
+-----+---------+----------+
| A10 | grass | 1 |
| B45 | cow | 3 |
| B98 | bird | 6 |
| B17 | grass | 1 |
| A20 | tree | 2 |
| A87 | farmer | 5 |
| B11 | grass | 1 |
| A33 | chicken | 4 |
| B56 | tree | 2 |
| A23 | farmer | 5 |
| B65 | cow | 3 |
+-----+---------+----------+
我需要過濾此 dataframe 以查找包含匹配match_id
值的行,條件是id
列還必須包含字符串A
和B
。
這是預期的 output:
+-----+-------+----------+
| id | value | match_id |
+-----+-------+----------+
| A10 | grass | 1 |
| B17 | grass | 1 |
| A20 | tree | 2 |
| B11 | grass | 1 |
| B56 | tree | 2 |
+-----+-------+----------+
例如,我如何在一行 pandas 代碼中做到這一點? 可重現的程序如下:
import pandas as pd
data_example = {'id': ['A10', 'B45', 'B98', 'B17', 'A20', 'A87', 'B11', 'A33', 'B56', 'A23', 'B65'],
'value': ['grass', 'cow', 'bird', 'grass', 'tree', 'farmer', 'grass', 'chicken', 'tree', 'farmer', 'cow'],
'match_id': [1, 3, 6, 1, 2, 5, 1, 4, 2, 5, 3]}
df_example = pd.DataFrame(data=data_example)
data_expected = {'id': ['A10', 'B17', 'A20', 'B11', 'B56'],
'value': ['grass', 'grass', 'tree', 'grass', 'tree'],
'match_id': [1, 1, 2, 1, 2]}
df_expected = pd.DataFrame(data=data_expected)
謝謝!
單行似乎很難,但你可以str.extract
從 id 中提取你想要的兩個字符串,然后groupby
match_id 並使用any
來查看每個 match_id 至少一行是否有你想要all
字符串,然后使用軸 1 將將兩個字符串的 match_id 設為True
。 然后你可以使用剛剛創建的系列到 select 只有真正的 match_id 在map
match_id 列之后。
s = df_example['id'].str.extract('(A)|(B)').notna()\
.groupby(df_example['match_id']).any().all(1)
df_expected = df_example.loc[df_example['match_id'].map(s), :]
print (df_expected)
id value match_id
0 A10 grass 1
3 B17 grass 1
4 A20 tree 2
6 B11 grass 1
8 B56 tree 2
對@Ben.T 解決方案的不同看法:
#create a helper column that combines the letters per gropu
res = (df_example
#the id column starts with a letter
.assign(letter = lambda x: x.id.str[0])
.groupby('match_id')
.letter.transform(','.join)
)
df['grp'] = res
df
id value match_id grp
0 A10 grass 1 A,B,B
1 B45 cow 3 B,B
2 B98 bird 6 B
3 B17 grass 1 A,B,B
4 A20 tree 2 A,B
5 A87 farmer 5 A,A
6 B11 grass 1 A,B,B
7 A33 chicken 4 A
8 B56 tree 2 A,B
9 A23 farmer 5 A,A
10 B65 cow 3 B,B
#filter for grps that contain A and B, and keep only relevant columns
df.loc[df.grp.str.contains('A,B'), "id":"match_id"]
id value match_id
0 A10 grass 1
3 B17 grass 1
4 A20 tree 2
6 B11 grass 1
8 B56 tree 2
#or u could use a list comprehension that assures u of both A and B (not just A following B)
filtered = [True if ("A" in ent) and ("B" in ent) else False for ent in df.grp.array]
df.loc[filtered,"id":"match_id"]
id value match_id
0 A10 grass 1
3 B17 grass 1
4 A20 tree 2
6 B11 grass 1
8 B56 tree 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.