[英]Looping through list and row for keyword match in pandas dataframe
I have a dataframe that looks like this.我有一个看起来像这样的数据框。 It has 1 column labeled 'utterances'.
它有 1 列标记为“话语”。
df.utterances
contains rows whose values are strings of n number words. df.utterances
包含值是 n 个单词的字符串的行。
utterances 0 okay go ahead. 1 Um, let me think. 2 nan that's not very encouraging. If they had a... 3 they wouldn't make you want to do it. nan nan ... 4 Yeah. The problem is though, it just, if we pu...
I also have a list of specific words.我还有一个特定单词的列表。 It is called
specific_words
.它被称为
specific_words
。 It looks like this:它看起来像这样:
specific_words = ['happy, 'good', 'encouraging', 'joyful']
I want to check if any of the words from specific_words
are found in any of the utterances.我想检查是否在任何话语中找到了来自
specific_words
的任何单词。 Essentially, I want to loop throughevery row in df.utterance
, and when I do so, loop through specific_list
to look for matches.本质上,我想遍历
df.utterance
中的df.utterance
行,当我这样做时,遍历specific_list
以查找匹配项。 If there is a match, I want to have a boolean column next to df.utterances that shows this.如果有匹配项,我希望在 df.utterances 旁边有一个布尔列来显示这一点。
def query_text_by_keyword(df, word_list): for word in word_list: for utt in df.utterance: if word in utt: match = True else: match = False return match df['query_match'] = df.apply(query_text_by_keyword, axis=1, args=(specific_words,))
It doesn't break, but it just returns False for every row, when it shouldn't.它不会中断,但它只是为每一行返回 False,当它不应该时。 For example, the first few rows should look like this:
例如,前几行应如下所示:
utterances query_match 0 okay go ahead. False 1 Um, let me think. False 2 nan that's not very encouraging. If they had a... True 3 they wouldn't make you want to do it. nan nan ... False 4 Yeah. The problem is though, it just, if we pu... False
@furas made a great suggestion to solve the initial question. @furas 提出了一个很好的建议来解决最初的问题。 However, I would also like to add another column that contains the specific word(s) from the query that indicates a match.
但是,我还想添加另一列,其中包含查询中指示匹配的特定单词。 Example:
例子:
utterances query_match word 0 okay go ahead False NaN 1 Um, let me think False NaN 2 nan that's not very encouraging. If they had a.. True 'encouraging' 3 they wouldn't make you want to do it. nan nan .. False NaN 4 Yeah. The problem is though, it just, if we pu.. False NaN
You can use regex
with str.contains(regex)
您可以将
regex
与str.contains(regex)
df['utterances'].str.constains("happy|good|encouraging|joyful")
You can create this regex
with您可以创建此
regex
query = '|'.join(specific_words)
You can also use str.lower()
because strings may have uppercase chars.您也可以使用
str.lower()
因为字符串可能有大写字符。
import pandas as pd
df = pd.DataFrame({
'utterances':[
'okay go ahead',
'Um, let me think.',
'nan that\'s not very encouraging. If they had a...',
'they wouldn\'t make you want to do it. nan nan ...',
'Yeah. The problem is though, it just, if we pu...',
]
})
specific_words = ['happy', 'good', 'encouraging', 'joyful']
query = '|'.join(specific_words)
df['query_match'] = df['utterances'].str.lower().str.contains(query)
print(df)
Result结果
utterances query_match
0 okay go ahead False
1 Um, let me think. False
2 nan that's not very encouraging. If they had a... True
3 they wouldn't make you want to do it. nan nan ... False
4 Yeah. The problem is though, it just, if we pu... False
EDIT: as @HenryYik mentioned in comment you can use case=False
instead of str.lower()
编辑:正如@HenryYik 在评论中提到的,你可以使用
case=False
而不是str.lower()
df['query_match'] = df['utterances'].str.contains(query, case=False)
More in doc: pandas.Series.str.contains更多文档: pandas.Series.str.contains
EDIT: to get matching word you ca use str.extract()
with regex
in (...)
编辑:要获得匹配的单词,您可以在
(...)
使用str.extract()
和regex
df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )
Working example:工作示例:
import pandas as pd
df = pd.DataFrame({
'utterances':[
'okay go ahead',
'Um, let me think.',
'nan that\'s not very encouraging. If they had a...',
'they wouldn\'t make you want to do it. nan nan ...',
'Yeah. The problem is though, it just, if we pu...',
'Yeah. happy good',
]
})
specific_words = ['happy', 'good', 'encouraging', 'joyful']
query = '|'.join(specific_words)
df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )
print(df)
In example I added 'Yeah. happy good'
在示例中,我添加了
'Yeah. happy good'
'Yeah. happy good'
to test which word will be returned happy
or good
. 'Yeah. happy good'
来测试哪个词会返回happy
或good
。 It returns first matching word.它返回第一个匹配的单词。
Result:结果:
utterances query_match word
0 okay go ahead False NaN
1 Um, let me think. False NaN
2 nan that's not very encouraging. If they had a... True encouraging
3 they wouldn't make you want to do it. nan nan ... False NaN
4 Yeah. The problem is though, it just, if we pu... False NaN
5 Yeah. happy good True happy
BTW: now you can even do顺便说一句:现在你甚至可以做
df['query_match'] = ~df['word'].isna()
or或者
df['query_match'] = df['word'].notna()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.