如何從文本文檔中查找常用短語

Question

我有一個帶有很多注釋/句子的文本文件，並且我想以某種方式找到在文檔本身中重復的最常見的短語。 我試着用NLTK稍微擺弄一下，發現了這個線程：如何從一系列文本輸入中提取常用/重要短語

但是，嘗試之后，我得到如下奇怪的結果：

>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)
[('m', 'e'), ('t', 's')]

在另一個很常見的短語“這很有趣”的文件中，我得到了一個空列表[]。

我應該怎么做呢？

這是我的完整代碼：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt')

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Answer 1

我沒有使用過nltk ，但是我懷疑問題是from_words接受字符串或tokens（？）對象。

類似於

with open('MkXVM6ad9nI.txt') as wordfile:
    text = wordfile.read)

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)

可能會起作用，盡管也可能有專門的文件API。

如何從文本文檔中查找常用短語

問題描述

1 個解決方案

解決方案1
4 已采納 2014-04-22 20:17:56

如何從文本文檔中查找常用短語

問題描述

1 個解決方案

解決方案1 4 已采納 2014-04-22 20:17:56

解決方案1
4 已采納 2014-04-22 20:17:56