Python代碼在nltk中計算頻繁單詞對

Question

我很困惑如何在文件中查找頻繁的單詞對。我首先獲得了二元組，但如何從這里開始呢？ 我嘗試在應用nltk.bigrams之前使用regexp剝離標點符號

raw=open("proj.txt","r").read()
tokens=nltk.word_tokenize(raw)
pairs=nltk.bigrams(tokens)
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(pairs)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)

Answer 1

您似乎在不導入的情況下調用了BigramCollocationFinder 。 正確的路徑是nltk.collocations.BigramCollocationFinder 。 因此，您可以嘗試執行此操作（確保您的文本文件包含文本！）：

>>> import nltk
>>> raw = open('test2.txt').read()
>>> tokens = nltk.word_tokenize(raw)
# or, to exclude punctuation, use something like the following instead of the above line:
# >>> tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(raw)
>>> pairs = nltk.bigrams(tokens)
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> trigram_measures = nltk.collocations.TrigramAssocMeasures()
>>> finder = nltk.collocations.BigramCollocationFinder.from_words(pairs)  # note the difference here!
>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)  # from the Old English text of Beowulf
[(('m\xe6g', 'Higelaces'), ('Higelaces', ',')), (('bearn', 'Ecg\xfeeowes'), ('Ecg\xfeeowes', ':')), (("''", 'Beowulf'), ('Beowulf', 'ma\xfeelode')), (('helm', 'Scyldinga'), ('Scyldinga', ':')), (('ne', 'cu\xfeon'), ('cu\xfeon', ',')), ((',', '\xe6r'), ('\xe6r', 'he')), ((',', 'helm'), ('helm', 'Scyldinga')), ((',', 'bearn'), ('bearn', 'Ecg\xfeeowes')), (('Ne', 'w\xe6s'), ('w\xe6s', '\xfe\xe6t')), (('Beowulf', 'ma\xfeelode'), ('ma\xfeelode', ','))]

Answer 2

聽起來您只想要單詞對列表。 如果是這樣，我想您是說像這樣使用finder.score_ngrams ：

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scores = finder.score_ngrams( bigram_measures.raw_freq )
print scores

可以使用其他評分指標。 這聽起來像你只想要的頻率，但對於一般的n-gram其他評價指標的位置- http://nltk.googlecode.com/svn-/trunk/doc/api/nltk.metrics.association.NgramAssocMeasures-class.html

Python代碼在nltk中計算頻繁單詞對

問題描述

2 個解決方案

解決方案1
1 2014-01-22 13:48:09

解決方案2
0 2014-01-22 13:50:24

Python代碼在nltk中計算頻繁單詞對

問題描述

2 個解決方案

解決方案1 1 2014-01-22 13:48:09

解決方案2 0 2014-01-22 13:50:24

解決方案1
1 2014-01-22 13:48:09

解決方案2
0 2014-01-22 13:50:24