简体   繁体   English

如何计算语料库文档中的单词

[英]How to count words in a corpus document

I want to know the best way to count words in a document. 我想知道计算文档中单词的最佳方法。 If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". 如果我有自己的“ corp.txt”语料库设置,并且想知道“ corp.txt”文件中“学生,信任,ayre”的发生频率。 What could I use? 我可以使用什么?

Would it be one of the following: 是否为以下之一:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

Thanks, Ray 谢谢,雷

I would suggest looking into collections.Counter. 我建议调查collections.Counter。 Especially for large amounts of text, this does the trick and is only limited by the available memory. 尤其是对于大量文本,这可以解决问题,并且仅受可用内存的限制。 It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. 在配备12Gb内存的计算机上,它一天半就能计算出300亿个令牌。 Pseudocode (variable Words will in practice be some reference to a file or similar): 伪代码(可变字实际上是对文件或类似文件的某种引用):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example). 完成后,这些单词将保存在字典my_counter中,然后可以将其写入磁盘或存储在其他位置(例如sqlite)。

Most people would just use a defaultdictionary (with a default value of 0). 大多数人只会使用默认字典(默认值为0)。 Every time you see a word, just increment the value by one: 每次看到一个单词时,只需将值加1:

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

You are almost there! 你快到了! You can index the FreqDist using the word you are interested in. Try the following: 您可以使用感兴趣的词为FreqDist编制索引。请尝试以下操作:

print fdist['students']
print fdist['ayre']
print fdist['full']

This gives you the count or number of occurrences of each word. 这样可以为您提供每个单词的计数或出现次数。 You said "how frequently" - frequency is different to number of occurrences - and that can got like this: 您说过“频率”-频率与发生次数不同-这样可以得出:

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

You can read a file and then tokenize and put the individual tokens into a FreqDist object in NLTK , see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html 您可以读取文件,然后进行标记化并将各个标记放入NLTKFreqDist对象中,请参阅http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[out]: [出]:

'blah' occurred 3 times

Alternatively, you can use a native Counter object from collections and you get the same counts, see https://docs.python.org/2/library/collections.html . 另外,您可以使用collections的本机Counter对象,并获得相同的计数,请参阅https://docs.python.org/2/library/collections.html Note that the keys in the FreqDist or Counter object are case sensitive, so you might also want to lowercase your tokenize: 请注意,FreqDist或Counter对象中的键区分大小写,因此您可能还希望小写标记化:

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM