简体   繁体   中英

Check if String in List of Strings is in DataFrame Pandas

I have a question regarding matching strings in a list to a column in a df.

I read this question Check if String in List of Strings is in Pandas DataFrame Column and understand, but my need is little different.

Code:

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)


df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)

Output I get:

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        True 
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        False
4  NaN             29000  DEF 456        False

Output I want:

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        False
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        True
4  NaN             29000  DEF 456        False

One way using word match:

pat = "|".join(search_for_these_values).replace(" ", "|")
match = df["Brand"].str.findall(r"\b(%s)\b" % pat)

Output:

0          [Honda]
1               []
2    [Ford, Focus]
3       [Audi, A4]
4              NaN
Name: Brand, dtype: object

You can then assign it back

df["match"] = match.str.len().ge(1)

Final output:

            Brand  Price Liscence Plate  match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

If we use the rule you outlined 'If one word is true, then true', then this means that if a row in Brand column has '2019', then True will be returned which I believe we don't want that. So

Having said that you can create a new list, which is the previous split() version of your search_for_these_values excluding years, using a list comprehension , and use isin with any :

# list comprehension
import re
s = [word for cars in search_for_these_values for word in cars.split() if not re.search(r'\d{4}',word)]

# Assign True / False
df['Match'] = df['Brand'].str.split(expand = True).isin(s).any(1)

Prints back:

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM