检查字符串列表是否在 pandas dataframe 列中

Question

I need to check if the whole list of strings is in a column.我需要检查整个字符串列表是否在一列中。 this is my code:这是我的代码：

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print(df2)

this is the output:这是 output：

 Empty DataFrame
Columns: [col1, col2, Concat]
Index: []

but if I work with letters as shows in the code below, I get the desired output但如果我使用下面代码中显示的字母，我会得到所需的 output

import pandas as pd
frame=["f", "a", "s"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x))]
print('\n',df2)

desired output:所需的 output：

       col1   col2         Concat
0  foo abc  story  foo abc story

how can I work with strings, not letters and still get the desired output?如何使用字符串而不是字母并仍然获得所需的 output？

Answer 1

The easiest way to do this is to change x to x.split() on line 5最简单的方法是在第 5 行将x更改为x.split()

import pandas as pd
frame=["foo", "abc", "story"]
df = pd.DataFrame({'col1': ['foo abc', 'foobar abc', 'bar32', 'abc 45'], 'col2': ['story', 'epic', 'story', 'baz']}) 
df["Concat"] = df["col1"] +' '+ df["col2"]
df2=df[df['Concat'].apply(lambda x: set(frame).issubset(x.split()))]
print(df2)

Right now you're testing if a set of words is a subset of a string.现在您正在测试一组单词是否是字符串的子集。 This operation is not well-defined, because it is ambiguous how to interpret a string as a set - is a string a set of characters or a set of words?这个操作没有很好的定义，因为如何将字符串解释为一个集合是模棱两可的——字符串是一组字符还是一组单词？ By default python interprets strings as a set of characters because it has no knowledge of natural language conventions like 'words are separated by spaces'.默认情况下，python 将字符串解释为一组字符，因为它不了解诸如“单词由空格分隔”之类的自然语言约定。 x.split() resolves this ambiguity by splitting words on whitespace, which I assume is what you want x.split()通过在空格上拆分单词来解决这种歧义，我认为这是您想要的

Answer 2

In the first code sample you are comparing a set of 3 elements frame with every string present in each row x which is converted to a set of 1 element only: issubset returns False for every row since 3 elements cannot be subset of 1 element.在第一个代码示例中，您将一组 3 个元素frame与每行x中存在的每个字符串进行比较，该字符串仅转换为一组 1 个元素： issubset为每一行返回False ，因为 3 个元素不能是 1 个元素的子集。

If you split x before calling apply you will test if frame is subset of a list of elements:如果您在调用apply之前拆分x ，您将测试frame是否是元素列表的子集：

df['Concat'].str.split().apply(lambda x: set(frame).issubset(x))

检查字符串列表是否在 pandas dataframe 列中

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-04-16 09:58:48

解决方案2
1 2020-04-16 10:04:58

检查字符串列表是否在 pandas dataframe 列中

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-04-16 09:58:48

解决方案2 1 2020-04-16 10:04:58

解决方案1
1 已采纳 2020-04-16 09:58:48

解决方案2
1 2020-04-16 10:04:58