循环遍历列表和行以在熊猫数据框中进行关键字匹配

Question

I have a dataframe that looks like this.我有一个看起来像这样的数据框。 It has 1 column labeled 'utterances'.它有 1 列标记为“话语”。 df.utterances contains rows whose values are strings of n number words. df.utterances包含值是 n 个单词的字符串的行。

 utterances 0 okay go ahead. 1 Um, let me think. 2 nan that's not very encouraging. If they had a... 3 they wouldn't make you want to do it. nan nan ... 4 Yeah. The problem is though, it just, if we pu...

I also have a list of specific words.我还有一个特定单词的列表。 It is called specific_words .它被称为specific_words 。 It looks like this:它看起来像这样：

 specific_words = ['happy, 'good', 'encouraging', 'joyful']

I want to check if any of the words from specific_words are found in any of the utterances.我想检查是否在任何话语中找到了来自specific_words的任何单词。 Essentially, I want to loop throughevery row in df.utterance , and when I do so, loop through specific_list to look for matches.本质上，我想遍历df.utterance中的df.utterance行，当我这样做时，遍历specific_list以查找匹配项。 If there is a match, I want to have a boolean column next to df.utterances that shows this.如果有匹配项，我希望在 df.utterances 旁边有一个布尔列来显示这一点。

 def query_text_by_keyword(df, word_list): for word in word_list: for utt in df.utterance: if word in utt: match = True else: match = False return match df['query_match'] = df.apply(query_text_by_keyword, axis=1, args=(specific_words,))

It doesn't break, but it just returns False for every row, when it shouldn't.它不会中断，但它只是为每一行返回 False，当它不应该时。 For example, the first few rows should look like this:例如，前几行应如下所示：

 utterances query_match 0 okay go ahead. False 1 Um, let me think. False 2 nan that's not very encouraging. If they had a... True 3 they wouldn't make you want to do it. nan nan ... False 4 Yeah. The problem is though, it just, if we pu... False

Edit编辑

@furas made a great suggestion to solve the initial question. @furas 提出了一个很好的建议来解决最初的问题。 However, I would also like to add another column that contains the specific word(s) from the query that indicates a match.但是，我还想添加另一列，其中包含查询中指示匹配的特定单词。 Example:例子：

 utterances query_match word 0 okay go ahead False NaN 1 Um, let me think False NaN 2 nan that's not very encouraging. If they had a.. True 'encouraging' 3 they wouldn't make you want to do it. nan nan .. False NaN 4 Yeah. The problem is though, it just, if we pu.. False NaN

Answer 1

You can use regex with str.contains(regex)您可以将regex与str.contains(regex)

df['utterances'].str.constains("happy|good|encouraging|joyful")

You can create this regex with您可以创建此regex

query = '|'.join(specific_words)

You can also use str.lower() because strings may have uppercase chars.您也可以使用str.lower()因为字符串可能有大写字符。

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.lower().str.contains(query)

print(df)

Result结果

                                          utterances  query_match
0                                      okay go ahead        False
1                                  Um, let me think.        False
2  nan that's not very encouraging. If they had a...         True
3  they wouldn't make you want to do it. nan nan ...        False
4  Yeah. The problem is though, it just, if we pu...        False

EDIT: as @HenryYik mentioned in comment you can use case=False instead of str.lower()编辑：正如@HenryYik 在评论中提到的，你可以使用case=False而不是str.lower()

df['query_match'] = df['utterances'].str.contains(query, case=False)

More in doc: pandas.Series.str.contains更多文档： pandas.Series.str.contains

EDIT: to get matching word you ca use str.extract() with regex in (...)编辑：要获得匹配的单词，您可以在(...)使用str.extract()和regex

df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )

Working example:工作示例：

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
        'Yeah. happy good',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )

print(df)

In example I added 'Yeah. happy good'在示例中，我添加了'Yeah. happy good' 'Yeah. happy good' to test which word will be returned happy or good . 'Yeah. happy good'来测试哪个词会返回happy或good 。 It returns first matching word.它返回第一个匹配的单词。

Result:结果：

                                          utterances  query_match         word
0                                      okay go ahead        False          NaN
1                                  Um, let me think.        False          NaN
2  nan that's not very encouraging. If they had a...         True  encouraging
3  they wouldn't make you want to do it. nan nan ...        False          NaN
4  Yeah. The problem is though, it just, if we pu...        False          NaN
5                                   Yeah. happy good         True        happy

BTW: now you can even do顺便说一句：现在你甚至可以做

df['query_match'] = ~df['word'].isna()

or或者

df['query_match'] = df['word'].notna()

循环遍历列表和行以在熊猫数据框中进行关键字匹配

问题描述

Edit编辑

1 个解决方案

解决方案1
1 已采纳 2020-02-11 02:52:26

循环遍历列表和行以在熊猫数据框中进行关键字匹配

问题描述

Edit编辑

1 个解决方案

解决方案1 1 已采纳 2020-02-11 02:52:26

解决方案1
1 已采纳 2020-02-11 02:52:26