简体   繁体   English

如果描述包含列表中的短语,Python Pandas总结得分

[英]Python Pandas Summing Up Score if Description Contains Phrase in a List

I have a long list (200,000+) of phrases: 我有一长串(200,000+)短语:

phrase_list = ['some word', 'another example', ...]

And a two column pandas dataframe with a description in the first column and some score in the second 还有一个两列pandas数据框,第一列中有描述,第二列中有一些得分

Description                                    Score
this sentence contains some word in it         6
some word is on my mind                        3
repeat another example of me                   2
this sentence has no matches                   100
another example with some word                 10

There are 300,000+ rows. 有300,000多行。 For each phrase in the phrase_list, I want to get the aggregate score if that phrase is found in each row. 对于phrase_list中的每个短语,如果在每一行中找到该短语,我想获得总分。 So, for 'some word', the score would be 6 + 3 + 10 = 19. For 'another example', the score would be 2 + 10 = 12. 因此,对于“某些单词”,分数将是6 + 3 + 10 = 19.对于“另一个例子”,分数将是2 + 10 = 12。

The code that I have thus far works but is very slow: 到目前为止我的代码工作但很慢:

phrase_score = []

for phrase in phrase_list:
    phrase_score.append([phrase, df['score'][df['description'].str.contains(phrase)].sum()])

I would like to return pandas dataframe with the phrase in one column and the score in the second (this part is trivial if I have the list of lists). 我想将pandas数据帧与一列中的短语和第二列中的分数一起返回(如果我有列表列表,这部分是微不足道的)。 However, I wanted a faster way to get the list of lists. 但是,我想要一种更快的方式来获取列表列表。

You can use a dictionary comprehension to generate the score for each phrase in your phrase list. 您可以使用字典理解为短语列表中的每个短语生成分数。

For each phrase, it creates of mask of those rows in the dataframe that contain that phrase. 对于每个短语,它会在包含该短语的数据框中创建这些行的掩码。 The mask is df.Description.str.contains(phrase) . 掩码是df.Description.str.contains(phrase) This mask is then applied to the scores which are in turn summed, effectively df.Score[mask].sum() . 然后将这个蒙版应用于分数,这些分数又相加,实际上是df.Score[mask].sum()

df = pd.DataFrame({'Description': ['this sentence contains some word in it', 
                                   'some word on my mind', 
                                   'repeat another word on my mind', 
                                   'this sentence has no matches', 
                                   'another example with some word'], 
                   'Score': [6, 3, 2, 100, 10]})

phrase_list = ['some word', 'another example']
scores = {phrase: df.Score[df.Description.str.contains(phrase)].sum() 
          for phrase in phrase_list}

>>> scores
{'another example': 10, 'some word': 19}

After re-reading your post in more detail, I note the similarity to your approach. 在更详细地重新阅读您的帖子后,我注意到您的方法的相似性。 I believe, however, that a dictionary comprehension might be faster than a for loop. 但是,我相信字典理解可能比for循环更快。 Based on my testing, however, the results appear similar. 然而,根据我的测试,结果看似相似。 I'm not aware of a more efficient solution without resulting to multiprocessing. 我没有意识到更有效的解决方案而不会导致多处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM