I need to check if the whole list of strings is in a column. this is my code:
import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']})
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print(df2)
this is the output:
Empty DataFrame
Columns: [col1, col2, Concat]
Index: []
but if I work with letters as shows in the code below, I get the desired output
import pandas as pd
frame=["f", "a", "s"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']})
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print('\n',df2)
desired output:
col1 col2 Concat
0 foo abc story foo abc story
how can I work with strings, not letters and still get the desired output?
The easiest way to do this is to change x
to x.split()
on line 5
import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']})
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x.split()))]
print(df2)
Right now you're testing if a set of words is a subset of a string. This operation is not well-defined, because it is ambiguous how to interpret a string as a set - is a string a set of characters or a set of words? By default python interprets strings as a set of characters because it has no knowledge of natural language conventions like 'words are separated by spaces'. x.split()
resolves this ambiguity by splitting words on whitespace, which I assume is what you want
In the first code sample you are comparing a set of 3 elements frame
with every string present in each row x
which is converted to a set of 1 element only: issubset
returns False
for every row since 3 elements cannot be subset of 1 element.
If you split x
before calling apply
you will test if frame
is subset of a list of elements:
df['Concat'].str.split().apply(lambda x: set(frame).issubset(x))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.