简体   繁体   English

检查字符串列表是否在 pandas dataframe 列中

[英]check if a list of string is in pandas dataframe column

I need to check if the whole list of strings is in a column.我需要检查整个字符串列表是否在一列中。 this is my code:这是我的代码:

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print(df2)

this is the output:这是 output:

 Empty DataFrame
Columns: [col1, col2, Concat]
Index: []

but if I work with letters as shows in the code below, I get the desired output但如果我使用下面代码中显示的字母,我会得到所需的 output

import pandas as pd
frame=["f", "a", "s"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print('\n',df2)

desired output:所需的 output:

       col1   col2         Concat
0  foo abc  story  foo abc story

how can I work with strings, not letters and still get the desired output?如何使用字符串而不是字母并仍然获得所需的 output?

The easiest way to do this is to change x to x.split() on line 5最简单的方法是在第 5 行将x更改为x.split()

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x.split()))]
print(df2)

Right now you're testing if a set of words is a subset of a string.现在您正在测试一组单词是否是字符串的子集。 This operation is not well-defined, because it is ambiguous how to interpret a string as a set - is a string a set of characters or a set of words?这个操作没有很好的定义,因为如何将字符串解释为一个集合是模棱两可的——字符串是一组字符还是一组单词? By default python interprets strings as a set of characters because it has no knowledge of natural language conventions like 'words are separated by spaces'.默认情况下,python 将字符串解释为一组字符,因为它不了解诸如“单词由空格分隔”之类的自然语言约定。 x.split() resolves this ambiguity by splitting words on whitespace, which I assume is what you want x.split()通过在空格上拆分单词来解决这种歧义,我认为这是您想要的

In the first code sample you are comparing a set of 3 elements frame with every string present in each row x which is converted to a set of 1 element only: issubset returns False for every row since 3 elements cannot be subset of 1 element.在第一个代码示例中,您将一组 3 个元素frame与每行x中存在的每个字符串进行比较,该字符串仅转换为一组 1 个元素: issubset为每一行返回False ,因为 3 个元素不能是 1 个元素的子集。

If you split x before calling apply you will test if frame is subset of a list of elements:如果您在调用apply之前拆分x ,您将测试frame是否是元素列表的子集:

df['Concat'].str.split().apply(lambda x: set(frame).issubset(x))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM