[英]How to count words in a corpus document
我想知道計算文檔中單詞的最佳方法。 如果我有自己的“ corp.txt”語料庫設置,並且想知道“ corp.txt”文件中“學生,信任,ayre”的發生頻率。 我可以使用什么?
是否為以下之一:
....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS
"students, trust, ayre" occur in full.
謝謝,雷
我建議調查collections.Counter。 尤其是對於大量文本,這可以解決問題,並且僅受可用內存的限制。 在配備12Gb內存的計算機上,它一天半就能計算出300億個令牌。 偽代碼(可變字實際上是對文件或類似文件的某種引用):
from collections import Counter
my_counter = Counter()
for word in Words:
my_counter.update(word)
完成后,這些單詞將保存在字典my_counter中,然后可以將其寫入磁盤或存儲在其他位置(例如sqlite)。
大多數人只會使用默認字典(默認值為0)。 每次看到一個單詞時,只需將值加1:
total = 0
count = defaultdict(lambda: 0)
for word in words:
total += 1
count[word] += 1
# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))
你快到了! 您可以使用感興趣的詞為FreqDist編制索引。請嘗試以下操作:
print fdist['students']
print fdist['ayre']
print fdist['full']
這樣可以為您提供每個單詞的計數或出現次數。 您說過“頻率”-頻率與發生次數不同-這樣可以得出:
print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')
您可以讀取文件,然后進行標記化並將各個標記放入NLTK
的FreqDist
對象中,請參閱http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html
from nltk.probability import FreqDist
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
for word in word_tokenize(fin.read()):
fdist.inc(word)
print "'blah' occurred", fdist['blah'], "times"
[出]:
'blah' occurred 3 times
另外,您可以使用collections
的本機Counter
對象,並獲得相同的計數,請參閱https://docs.python.org/2/library/collections.html 。 請注意,FreqDist或Counter對象中的鍵區分大小寫,因此您可能還希望小寫標記化:
from collections import Counter
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
fdist.update(word_tokenize(fin.read().lower()))
print "'blah' occurred", fdist['blah'], "times"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.