简体   繁体   English

使用python检查字符串中的任何单词是否出现在列表中

[英]Checking if any word in a string appears in a list using python

I have a pandas dataframe that contains a column of several thousands of comments.我有一个包含数千条评论的列的熊猫数据框。 I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column.我想遍历列中的每一行,检查评论是否包含在我创建的单词列表中找到的任何单词,如果评论包含我的列表中的单词,我想将其标记为一个单独的列。 This is what I have so far in my code:这是我到目前为止在我的代码中的内容:

retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']

def word_checker(row):
    for sentence in df['comments']: 
        if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
            return '401k/Retirement'
        else:
            return 'Other'

df['topic'] = df.apply(word_checker,axis=1)    

The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list.代码将我的数据框中的每一条评论都标记为“其他”,即使我已经仔细检查了许多评论是否包含我列表中的一个或几个单词。 Any ideas for how I may correct my code?关于如何更正我的代码的任何想法? I'd greatly appreciate your help.我将不胜感激您的帮助。

Probably more convenient to have a set version of retirements_word_list (for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:也许更方便地有一组版本retirements_word_list (高效inclusing测试),然后遍历的一句话,在这组检查包容,而不是倒过来:

retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']

retirement_words_set = set(retirement_words_list)

and then进而

    if any(word in retirement_words_list for word in sentence.lower().split()):
            # .... etc ....

Your code is just checking whether any word in retirement_words_list is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching' and 'retirement' on the list given that 'match' and 'retire' are already included.您的代码只是检查retirement_words_list中的任何单词是否是句子的子字符串,但实际上您必须寻找全字匹配,否则在列表中包含'matching''retirement'是没有意义的,因为'match''retire'已经包括在内。 Hence the use of split -- and the reason why we can then also reverse the logic.因此使用split —— 以及为什么我们也可以颠倒逻辑的原因。

NOTE : You may need some further changes because your function word_checker has a parameter called row which it does not use.注意:您可能需要进一步更改,因为您的函数word_checker有一个不使用的名为row的参数。 Possibly what you meant to do was something like:可能你想做的是这样的:

def word_checker(sentence):
    if any(word in retirement_words_list for word in sentence.lower().split()):
        return '401k/Retirement'
    else:
        return 'Other'

and:和:

df['topic'] = df['comments'].apply(word_checker,axis=1)    

where sentence is the contents of each row from the comments column.其中sentencecomments列中每一行的内容。

这个简化版本(没有正则表达式)行不通吗?

if any(word in sentence.lower() for word in retirement_words_list):

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM