简体   繁体   English

检查字符串列表中的字符串是否在 DataFrame Pandas

[英]Check if String in List of Strings is in DataFrame Pandas

I have a question regarding matching strings in a list to a column in a df.我有一个关于将列表中的字符串匹配到 df 中的列的问题。

I read this question Check if String in List of Strings is in Pandas DataFrame Column and understand, but my need is little different.我读了这个问题Check if String in List of Strings is in Pandas DataFrame Column并理解,但我的需求略有不同。

Code:代码:

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)


df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)

Output I get: Output 我得到:

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        True 
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        False
4  NaN             29000  DEF 456        False

Output I want: Output 我想要:

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        False
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        True
4  NaN             29000  DEF 456        False

One way using word match:使用单词匹配的一种方法:

pat = "|".join(search_for_these_values).replace(" ", "|")
match = df["Brand"].str.findall(r"\b(%s)\b" % pat)

Output: Output:

0          [Honda]
1               []
2    [Ford, Focus]
3       [Audi, A4]
4              NaN
Name: Brand, dtype: object

You can then assign it back然后您可以将其分配回去

df["match"] = match.str.len().ge(1)

Final output:最终 output:

            Brand  Price Liscence Plate  match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

If we use the rule you outlined 'If one word is true, then true', then this means that if a row in Brand column has '2019', then True will be returned which I believe we don't want that.如果我们使用您概述的规则“如果一个词为真,则为真”,那么这意味着如果品牌列中的一行有“2019”,那么将返回True ,我相信我们不希望这样。 So所以

Having said that you can create a new list, which is the previous split() version of your search_for_these_values excluding years, using a list comprehension , and use isin with any :话虽如此,您可以使用list comprehension创建一个新列表,这是您的search_for_these_values的先前split()版本(不包括年份),并将isinany一起使用:

# list comprehension
import re
s = [word for cars in search_for_these_values for word in cars.split() if not re.search(r'\d{4}',word)]

# Assign True / False
df['Match'] = df['Brand'].str.split(expand = True).isin(s).any(1)

Prints back:打印回来:

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM