[英]Check if String in List of Strings is in DataFrame Pandas
I have a question regarding matching strings in a list to a column in a df.我有一个关于将列表中的字符串匹配到 df 中的列的问题。
I read this question Check if String in List of Strings is in Pandas DataFrame Column and understand, but my need is little different.我读了这个问题Check if String in List of Strings is in Pandas DataFrame Column并理解,但我的需求略有不同。
Code:代码:
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
'Price': [22000,25000,27000,35000, 29000],
'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}
df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Output I get: Output 我得到:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
Output I want: Output 我想要:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
One way using word match:使用单词匹配的一种方法:
pat = "|".join(search_for_these_values).replace(" ", "|")
match = df["Brand"].str.findall(r"\b(%s)\b" % pat)
Output: Output:
0 [Honda]
1 []
2 [Ford, Focus]
3 [Audi, A4]
4 NaN
Name: Brand, dtype: object
You can then assign it back然后您可以将其分配回去
df["match"] = match.str.len().ge(1)
Final output:最终 output:
Brand Price Liscence Plate match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
If we use the rule you outlined 'If one word is true, then true', then this means that if a row in Brand column has '2019', then True
will be returned which I believe we don't want that.如果我们使用您概述的规则“如果一个词为真,则为真”,那么这意味着如果品牌列中的一行有“2019”,那么将返回True
,我相信我们不希望这样。 So所以
Having said that you can create a new list, which is the previous split()
version of your search_for_these_values
excluding years, using a list comprehension
, and use isin
with any
:话虽如此,您可以使用list comprehension
创建一个新列表,这是您的search_for_these_values
的先前split()
版本(不包括年份),并将isin
与any
一起使用:
# list comprehension
import re
s = [word for cars in search_for_these_values for word in cars.split() if not re.search(r'\d{4}',word)]
# Assign True / False
df['Match'] = df['Brand'].str.split(expand = True).isin(s).any(1)
Prints back:打印回来:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.