[英]Counting phrase frequency in Python 3.3.2
我一直在研究Web上的不同來源,並嘗試了各種方法,但只能找到如何計算唯一單詞而不是唯一短語的頻率。 到目前為止,我的代碼如下:
import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
if word in wanted:
cnt [word] += 1
print (cnt)
如果可能的話,我還要統計一下本文中使用“中央銀行”和“高通貨膨脹”這一短語的次數。 感謝您提供的任何建議或指導。
首先,這就是我將如何生成您執行的cnt
(以減少內存開銷)的方式
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))
現在,關於您的短語問題:
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2)):
phrase = ' '.join([w1, w2])
if phrase in phrases:
cnt[phrase] += 1
希望這可以幫助
假設文件不是很大-這是最簡單的方法
for w1, w2 in zip(words, words[1:]):
phrase = w1 + " " + w2
if phrase in wanted:
cnt[phrase] += 1
print(cnt)
要在一個小文件中計算幾個短語的字面出現次數:
with open("input_text.txt") as file:
text = file.read()
n = text.count("high inflation rate")
nltk.collocations
模塊提供了一些工具來識別經常連續出現的單詞:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))
# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.