簡體   English   中英

如何計算語料庫文檔中的單詞

[英]How to count words in a corpus document

我想知道計算文檔中單詞的最佳方法。 如果我有自己的“ corp.txt”語料庫設置,並且想知道“ corp.txt”文件中“學生,信任,ayre”的發生頻率。 我可以使用什么?

是否為以下之一:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

謝謝,雷

我建議調查collections.Counter。 尤其是對於大量文本,這可以解決問題,並且僅受可用內存的限制。 在配備12Gb內存的計算機上,它一天半就能計算出300億個令牌。 偽代碼(可變字實際上是對文件或類似文件的某種引用):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后,這些單詞將保存在字典my_counter中,然后可以將其寫入磁盤或存儲在其他位置(例如sqlite)。

大多數人只會使用默認字典(默認值為0)。 每次看到一個單詞時,只需將值加1:

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

你快到了! 您可以使用感興趣的詞為FreqDist編制索引。請嘗試以下操作:

print fdist['students']
print fdist['ayre']
print fdist['full']

這樣可以為您提供每個單詞的計數或出現次數。 您說過“頻率”-頻率與發生次數不同-這樣可以得出:

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

您可以讀取文件,然后進行標記化並將各個標記放入NLTKFreqDist對象中,請參閱http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[出]:

'blah' occurred 3 times

另外,您可以使用collections的本機Counter對象,並獲得相同的計數,請參閱https://docs.python.org/2/library/collections.html 請注意,FreqDist或Counter對象中的鍵區分大小寫,因此您可能還希望小寫標記化:

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM