如何計算語料庫文檔中的單詞

Question

我想知道計算文檔中單詞的最佳方法。 如果我有自己的“ corp.txt”語料庫設置，並且想知道“ corp.txt”文件中“學生，信任，ayre”的發生頻率。 我可以使用什么？

是否為以下之一：

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

謝謝，雷

Answer 1

我建議調查collections.Counter。 尤其是對於大量文本，這可以解決問題，並且僅受可用內存的限制。 在配備12Gb內存的計算機上，它一天半就能計算出300億個令牌。 偽代碼（可變字實際上是對文件或類似文件的某種引用）：

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后，這些單詞將保存在字典my_counter中，然后可以將其寫入磁盤或存儲在其他位置（例如sqlite）。

Answer 2

大多數人只會使用默認字典（默認值為0）。 每次看到一個單詞時，只需將值加1：

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

Answer 3

你快到了！ 您可以使用感興趣的詞為FreqDist編制索引。請嘗試以下操作：

print fdist['students']
print fdist['ayre']
print fdist['full']

這樣可以為您提供每個單詞的計數或出現次數。 您說過“頻率”-頻率與發生次數不同-這樣可以得出：

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

Answer 4

您可以讀取文件，然后進行標記化並將各個標記放入NLTK的FreqDist對象中，請參閱http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[出]：

'blah' occurred 3 times

另外，您可以使用collections的本機Counter對象，並獲得相同的計數，請參閱https://docs.python.org/2/library/collections.html 。 請注意，FreqDist或Counter對象中的鍵區分大小寫，因此您可能還希望小寫標記化：

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

如何計算語料庫文檔中的單詞

問題描述

4 個解決方案

解決方案1
5 2014-04-06 19:06:14

解決方案2
3 2011-11-15 16:01:42

解決方案3
2 2012-07-11 18:41:29

解決方案4
0 2014-04-07 05:10:15

如何計算語料庫文檔中的單詞

問題描述

4 個解決方案

解決方案1 5 2014-04-06 19:06:14

解決方案2 3 2011-11-15 16:01:42

解決方案3 2 2012-07-11 18:41:29

解決方案4 0 2014-04-07 05:10:15

解決方案1
5 2014-04-06 19:06:14

解決方案2
3 2011-11-15 16:01:42

解決方案3
2 2012-07-11 18:41:29

解決方案4
0 2014-04-07 05:10:15