从单词列表中的字符串列中查找确切的单词，单词之间有空格

Question

如果列表中的任何单词与dataframe字符串列完全匹配，我想用1或0创建一个新列。

列表中的单词之间可以有多个空格 ，因此我无法使用str.split（）进行完全匹配。

list_provided=["mul the","a b c"]
#how my dataframe looks
id  text
a    simultaneous there the
b    simultaneous there
c    mul why the
d    mul the
e    simul a b c
f    a c b

预期产量

id  text                      found
a    simultaneous there the    0
b    simultaneous there        0
c    mul why the               0
d    mul the                   1
e    simul a b c               1 
f    a c b                     0

列表元素中单词的排序也很重要！！

到目前为止一直尝试的代码

data=pd.DataFrame({"id":("a","b","c","d","e","f"), "text":("simultaneous there the","simultaneous there","mul why the","mul the","simul a b c","a c b")})
list_of_word=["mul the","a b c"]
pattern = '|'.join(list_of_word)
data['found'] = data['text'].apply(lambda x: sum(i in list_of_test_2 for i in x.split()))
data['found']=np.where(data['found']>0,1,0)
data
###Output generated###
id  text                   found
a   simultaneous there the  0
b   simultaneous there      0
c   mul why the             0
d   mul the                 0
e   simul a b c             0
f   a c b                   0

如何获得期望的输出，我必须在列表中搜索数据框字符串列之间有多个空格的单词的完全匹配？

Answer 1

您已经str.contains在那里了，您已经完成了所有基础工作，现在剩下的就是调用正确的函数，在这种情况下， str.contains 。

data['found'] = data.text.str.contains(pattern).astype(int)
data

  id                    text  found
0  a  simultaneous there the      0
1  b      simultaneous there      0
2  c             mul why the      0
3  d                 mul the      1
4  e             simul a b c      1
5  f                   a c b      0

如果您的模式本身包含正则表达式OR管道，请尝试先转义它们：

import re
pattern = '|'.join([re.escape(i) for i in list_of_word])

Answer 2

您可以在str.contains的帮助下实现这一目标。 这也可以占用正则表达式！

data['found'] = np.where(data['text'].str.contains(pattern),1,0)

从单词列表中的字符串列中查找确切的单词，单词之间有空格

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-04-11 11:26:22

解决方案2
0 2018-04-11 11:27:10

从单词列表中的字符串列中查找确切的单词，单词之间有空格

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-04-11 11:26:22

解决方案2 0 2018-04-11 11:27:10

解决方案1
1 已采纳 2018-04-11 11:26:22

解决方案2
0 2018-04-11 11:27:10