简体   繁体   中英

Python Pandas Summing Up Score if Description Contains Phrase in a List

I have a long list (200,000+) of phrases:

phrase_list = ['some word', 'another example', ...]

And a two column pandas dataframe with a description in the first column and some score in the second

Description                                    Score
this sentence contains some word in it         6
some word is on my mind                        3
repeat another example of me                   2
this sentence has no matches                   100
another example with some word                 10

There are 300,000+ rows. For each phrase in the phrase_list, I want to get the aggregate score if that phrase is found in each row. So, for 'some word', the score would be 6 + 3 + 10 = 19. For 'another example', the score would be 2 + 10 = 12.

The code that I have thus far works but is very slow:

phrase_score = []

for phrase in phrase_list:
    phrase_score.append([phrase, df['score'][df['description'].str.contains(phrase)].sum()])

I would like to return pandas dataframe with the phrase in one column and the score in the second (this part is trivial if I have the list of lists). However, I wanted a faster way to get the list of lists.

You can use a dictionary comprehension to generate the score for each phrase in your phrase list.

For each phrase, it creates of mask of those rows in the dataframe that contain that phrase. The mask is df.Description.str.contains(phrase) . This mask is then applied to the scores which are in turn summed, effectively df.Score[mask].sum() .

df = pd.DataFrame({'Description': ['this sentence contains some word in it', 
                                   'some word on my mind', 
                                   'repeat another word on my mind', 
                                   'this sentence has no matches', 
                                   'another example with some word'], 
                   'Score': [6, 3, 2, 100, 10]})

phrase_list = ['some word', 'another example']
scores = {phrase: df.Score[df.Description.str.contains(phrase)].sum() 
          for phrase in phrase_list}

>>> scores
{'another example': 10, 'some word': 19}

After re-reading your post in more detail, I note the similarity to your approach. I believe, however, that a dictionary comprehension might be faster than a for loop. Based on my testing, however, the results appear similar. I'm not aware of a more efficient solution without resulting to multiprocessing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM