简体   繁体   中英

How to test if a string contains one of the substrings stored in a list column in pandas?

My question is very similar to How to test if a string contains one of the substrings in a list, in pandas? except that the list of substrings to check varies by observation and is stored in a list column. Is there a way to access that list in a vectorized way by referring to the series?

Example dataset

import pandas as pd

df = pd.DataFrame([{'a': 'Bob Smith is great.', 'b': ['Smith', 'foo'])},
                   {'a': 'The Sun is a mass of incandescent gas.', 'b': ['Jones', 'bar']}])
print(df)

I'd like to generate a third column, 'c', that equals 1 if any of the 'b' strings is a substring of 'a' for its respective row, and zero otherwise. That is, I'd expect in this case:

                                        a             b  c
0                     Bob Smith is great.  [Smith, foo]  1
1  The Sun is a mass of incandescent gas.  [Jones, bar]  0

My attempt:

df['c'] = df.a.str.contains('|'.join(df.b))  # Does not work.


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_4092606/761645043.py in <module>
----> 1 df['c'] = df.a.str.contains('|'.join(df.b))  # Does not work.

TypeError: sequence item 0: expected str instance, list found

You can just use zip and list comprehension:

df['c'] = [int(any(w in a for w in b)) for a, b in zip(df.a, df.b)]

df
#                                        a             b  c
#0                     Bob Smith is great.  [Smith, foo]  1
#1  The Sun is a mass of incandescent gas.  [Jones, bar]  0

If you don't care about case:

df['c'] = [any(w.lower() in a for w in b) for a, b in zip(df.a.str.lower(), df.b)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM