check if a list of string is in pandas dataframe column

Question

I need to check if the whole list of strings is in a column. this is my code:

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print(df2)

this is the output:

 Empty DataFrame
Columns: [col1, col2, Concat]
Index: []

but if I work with letters as shows in the code below, I get the desired output

import pandas as pd
frame=["f", "a", "s"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print('\n',df2)

desired output:

       col1   col2         Concat
0  foo abc  story  foo abc story

how can I work with strings, not letters and still get the desired output?

Answer 1

The easiest way to do this is to change x to x.split() on line 5

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x.split()))]
print(df2)

Right now you're testing if a set of words is a subset of a string. This operation is not well-defined, because it is ambiguous how to interpret a string as a set - is a string a set of characters or a set of words? By default python interprets strings as a set of characters because it has no knowledge of natural language conventions like 'words are separated by spaces'. x.split() resolves this ambiguity by splitting words on whitespace, which I assume is what you want

Answer 2

In the first code sample you are comparing a set of 3 elements frame with every string present in each row x which is converted to a set of 1 element only: issubset returns False for every row since 3 elements cannot be subset of 1 element.

If you split x before calling apply you will test if frame is subset of a list of elements:

df['Concat'].str.split().apply(lambda x: set(frame).issubset(x))

check if a list of string is in pandas dataframe column

Question

2 answers

solution1
1 ACCPTED 2020-04-16 09:58:48

solution2
1 2020-04-16 10:04:58

check if a list of string is in pandas dataframe column

Question

2 answers

solution1 1 ACCPTED 2020-04-16 09:58:48

solution2 1 2020-04-16 10:04:58

solution1
1 ACCPTED 2020-04-16 09:58:48

solution2
1 2020-04-16 10:04:58