從出現在python的數千條記錄中的列表中計算每個單詞的所有出現次數

Question

我有一個評論列表和一個單詞列表，我試圖統計每個單詞在每個評論中顯示的次數。 關鍵字列表大約有30個，並且可能會增加/更改。 當前的評論總數約為5000，評論字數從3到數百個字不等。 評論的數量肯定會增加。 目前，關鍵字列表是靜態的，並且評論的數量不會增加太多，因此可以使用任何一種解決方案來獲得每個評論中的關鍵字數量，但是理想情況下，如果解決方案不存在主要的性能問題，數量評論急劇增加或關鍵字更改，所有評論都必須重新分析。

我一直在閱讀有關stackoverflow的不同方法，但是還沒有任何方法可以工作。 我知道您可以使用Skikit Learn來獲取每個單詞的計數，但是還沒有弄清楚是否有一種方法可以計算短語。 我也嘗試過各種正則表達式。 如果關鍵字列表都是單個單詞，我知道我可以很容易地使用skikit學習，循環或正則表達式，但是當關鍵字包含多個單詞時，我會遇到問題。 我嘗試過的兩個鏈接

Python-檢查Word是否在字符串中

使用正則表達式和Python進行詞組匹配

這里的解決方案很接近，但是它不計算同一單詞的所有出現次數。如何從出現在列表列表中的單詞列表中返回單詞計數？

關鍵字列表和評論列表都從MySQL數據庫中提取。 所有關鍵字均小寫。 所有文字均設為小寫，並且除空格外的所有非字母數字均已從評論中刪除。 我最初的想法是使用skikit學習countvectorizer對單詞進行計數，但不知道如何處理對我切換的短語的計數。 我目前正在嘗試使用循環和正則表達式，但是我願意接受任何解決方案

# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']

 for review in reviews:
     for word in keywords:
         results = re.findall(r'\bword\b',review)  #this returns no results, the variable word is not getting picked up
         #--also tried variations of this to no avail
         #--tried creating the pattern first and passing it
         # pattern = "r'\\b" + word + "\\b'"
         # results = re.findall(pattern,review)  #this errors with the msg: sre_constants.error: multiple repeat at position 9


#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1

Answer 1

您嘗試過的所有選項都不搜索word的值：

results = re.findall(r'\\bword\\b', review)檢查字符串中的單詞word。
當您嘗試pattern = "r'\\\\b" + word + "\\\\b'"將檢查字符串“ r'\\ b [單詞的值] \\ b'。

您可以使用第一個選項，但模式應為r'\\b%s\\b' % word 。 這將搜索單詞的值。

Answer 2

我首先會蠻力地做，而不是使它過於復雜，然后再嘗試對其進行優化。

from collections import defaultdict

keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']

results = dict()
for i in keywords:
    for j in reviews:
        results[i] = results.get(i, 0) + j.count(i)


print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}

重要的是，我們使用.get查詢字典，以防萬一沒有鍵集，我們不想處理KeyError異常。

如果您想走復雜的路線，則可以構建自己的trie和counter結構來在大型文本文件中進行搜索。

解析1 TB的文本並有效地計算每個單詞的出現次數

從出現在python的數千條記錄中的列表中計算每個單詞的所有出現次數

問題描述

2 個解決方案

解決方案1
0 2017-11-25 00:36:21

解決方案2
0 已采納 2017-11-25 00:47:36

從出現在python的數千條記錄中的列表中計算每個單詞的所有出現次數

問題描述

2 個解決方案

解決方案1 0 2017-11-25 00:36:21

解決方案2 0 已采納 2017-11-25 00:47:36

解決方案1
0 2017-11-25 00:36:21

解決方案2
0 已采納 2017-11-25 00:47:36