简体   繁体   中英

Pandas find all words from row in dataframe match with list

I have a dict of emotions (anger, fear, anticipation, trust, etc...) with words associated to the emotions

anticipationlist:

{'anticipation': ['abundance',
          'opera',
          'star',
          'start',
          'achievement',
          'acquiring',...]

And, I have a dataframe with rows of sentences.I want to find the words that associated to the emotion

| text                          |
|---------------------------    |
| operation start yesterday     |
| operation start now           |
| operation halt                |

Expected output

| text                          | result        |
|---------------------------    |-------------  |
| operation start yesterday     | start         |
| operation start now           | start         |
| operation achievement         | achievement   |

I tried

df['result']=df["text"].str.findall(r"\b"+"|".join(anticipationlist) +r"\b").apply(", ".join)

my result is

| text                          | result                |
|---------------------------    |--------------------   |
| operation start yesterday     | opera, star           |
| operation start now           | opera, star           |
| operation achievement         | opera, achievement    |

How to improve my code to get my desired outcome?

You can add words boundaries for each value separately:

pat = '|'.join(r"\b{}\b".format(x) for x in anticipationlist)
df['result']=df["text"].str.findall(pat).apply(", ".join)

print (df)
                        text       result
0  operation start yesterday        start
1        operation start now        start
2      operation achievement  achievement

Here's an approach that doesn't use regex. Also, I changed your anticipationlist from a dict to a list .

import pandas as pd

anticipationlist= ['abundance',
                    'opera',
                    'star',
                    'start',
                    'achievement',
                    'acquiring',
                    ]

values = [
    'operation start yesterday',
    'operation start now',
    'operation achievement',
    ]
df = pd.DataFrame(data=values, columns=['text'])

def find_values(x):
    results = []
    for value in anticipationlist:
        for word in x.split():
            if word == value:
                results.append(word)
    return ' '.join(results)
df['result'] = df['text'].apply(lambda x: find_values(x))

print(df.head())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM