简体   繁体   中英

check if a list of string is in pandas dataframe column

I need to check if the whole list of strings is in a column. this is my code:

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print(df2)

this is the output:

 Empty DataFrame
Columns: [col1, col2, Concat]
Index: []

but if I work with letters as shows in the code below, I get the desired output

import pandas as pd
frame=["f", "a", "s"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print('\n',df2)

desired output:

       col1   col2         Concat
0  foo abc  story  foo abc story

how can I work with strings, not letters and still get the desired output?

The easiest way to do this is to change x to x.split() on line 5

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x.split()))]
print(df2)

Right now you're testing if a set of words is a subset of a string. This operation is not well-defined, because it is ambiguous how to interpret a string as a set - is a string a set of characters or a set of words? By default python interprets strings as a set of characters because it has no knowledge of natural language conventions like 'words are separated by spaces'. x.split() resolves this ambiguity by splitting words on whitespace, which I assume is what you want

In the first code sample you are comparing a set of 3 elements frame with every string present in each row x which is converted to a set of 1 element only: issubset returns False for every row since 3 elements cannot be subset of 1 element.

If you split x before calling apply you will test if frame is subset of a list of elements:

df['Concat'].str.split().apply(lambda x: set(frame).issubset(x))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM