简体   繁体   中英

Looping through list and row for keyword match in pandas dataframe

I have a dataframe that looks like this. It has 1 column labeled 'utterances'. df.utterances contains rows whose values are strings of n number words.

 utterances 0 okay go ahead. 1 Um, let me think. 2 nan that's not very encouraging. If they had a... 3 they wouldn't make you want to do it. nan nan ... 4 Yeah. The problem is though, it just, if we pu...

I also have a list of specific words. It is called specific_words . It looks like this:

 specific_words = ['happy, 'good', 'encouraging', 'joyful']

I want to check if any of the words from specific_words are found in any of the utterances. Essentially, I want to loop throughevery row in df.utterance , and when I do so, loop through specific_list to look for matches. If there is a match, I want to have a boolean column next to df.utterances that shows this.

 def query_text_by_keyword(df, word_list): for word in word_list: for utt in df.utterance: if word in utt: match = True else: match = False return match df['query_match'] = df.apply(query_text_by_keyword, axis=1, args=(specific_words,))

It doesn't break, but it just returns False for every row, when it shouldn't. For example, the first few rows should look like this:

 utterances query_match 0 okay go ahead. False 1 Um, let me think. False 2 nan that's not very encouraging. If they had a... True 3 they wouldn't make you want to do it. nan nan ... False 4 Yeah. The problem is though, it just, if we pu... False

Edit

@furas made a great suggestion to solve the initial question. However, I would also like to add another column that contains the specific word(s) from the query that indicates a match. Example:

 utterances query_match word 0 okay go ahead False NaN 1 Um, let me think False NaN 2 nan that's not very encouraging. If they had a.. True 'encouraging' 3 they wouldn't make you want to do it. nan nan .. False NaN 4 Yeah. The problem is though, it just, if we pu.. False NaN

You can use regex with str.contains(regex)

df['utterances'].str.constains("happy|good|encouraging|joyful")

You can create this regex with

query = '|'.join(specific_words)

You can also use str.lower() because strings may have uppercase chars.

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.lower().str.contains(query)

print(df)

Result

                                          utterances  query_match
0                                      okay go ahead        False
1                                  Um, let me think.        False
2  nan that's not very encouraging. If they had a...         True
3  they wouldn't make you want to do it. nan nan ...        False
4  Yeah. The problem is though, it just, if we pu...        False

EDIT: as @HenryYik mentioned in comment you can use case=False instead of str.lower()

df['query_match'] = df['utterances'].str.contains(query, case=False)

More in doc: pandas.Series.str.contains


EDIT: to get matching word you ca use str.extract() with regex in (...)

df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )

Working example:

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
        'Yeah. happy good',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )

print(df)

In example I added 'Yeah. happy good' 'Yeah. happy good' to test which word will be returned happy or good . It returns first matching word.

Result:

                                          utterances  query_match         word
0                                      okay go ahead        False          NaN
1                                  Um, let me think.        False          NaN
2  nan that's not very encouraging. If they had a...         True  encouraging
3  they wouldn't make you want to do it. nan nan ...        False          NaN
4  Yeah. The problem is though, it just, if we pu...        False          NaN
5                                   Yeah. happy good         True        happy

BTW: now you can even do

df['query_match'] = ~df['word'].isna()

or

df['query_match'] = df['word'].notna()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM