[英]Finding exact word in a column of strings from a list of words having spaces in between
如果列表中的任何单词与dataframe字符串列完全匹配,我想用1或0创建一个新列。
列表中的单词之间可以有多个空格 ,因此我无法使用str.split()进行完全匹配。
list_provided=["mul the","a b c"]
#how my dataframe looks
id text
a simultaneous there the
b simultaneous there
c mul why the
d mul the
e simul a b c
f a c b
预期产量
id text found
a simultaneous there the 0
b simultaneous there 0
c mul why the 0
d mul the 1
e simul a b c 1
f a c b 0
列表元素中单词的排序也很重要!!
到目前为止一直尝试的代码
data=pd.DataFrame({"id":("a","b","c","d","e","f"), "text":("simultaneous there the","simultaneous there","mul why the","mul the","simul a b c","a c b")})
list_of_word=["mul the","a b c"]
pattern = '|'.join(list_of_word)
data['found'] = data['text'].apply(lambda x: sum(i in list_of_test_2 for i in x.split()))
data['found']=np.where(data['found']>0,1,0)
data
###Output generated###
id text found
a simultaneous there the 0
b simultaneous there 0
c mul why the 0
d mul the 0
e simul a b c 0
f a c b 0
如何获得期望的输出,我必须在列表中搜索数据框字符串列之间有多个空格的单词的完全匹配?
您已经str.contains
在那里了,您已经完成了所有基础工作,现在剩下的就是调用正确的函数,在这种情况下, str.contains
。
data['found'] = data.text.str.contains(pattern).astype(int)
data
id text found
0 a simultaneous there the 0
1 b simultaneous there 0
2 c mul why the 0
3 d mul the 1
4 e simul a b c 1
5 f a c b 0
如果您的模式本身包含正则表达式OR管道,请尝试先转义它们:
import re
pattern = '|'.join([re.escape(i) for i in list_of_word])
您可以在str.contains的帮助下实现这一目标。 这也可以占用正则表达式!
data['found'] = np.where(data['text'].str.contains(pattern),1,0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.