[英]Finding the number of times each word in a hashset occurs in text document
I'm implementing a Naive Bayes text classification algorithm in Java. 我正在用Java实现Naive Bayes文本分类算法。
What I have done so far is, declare a hashset called Vocabulary which stores all the unique words from a given text file (test file). 到目前为止,我所做的是声明一个名为Vocabulary的哈希集,该哈希集存储给定文本文件(测试文件)中的所有唯一单词。
One of the steps in the algorithm is to concatenate all the members of the test files into a single text file. 算法中的步骤之一是将测试文件的所有成员连接到单个文本文件中。 This turns out to be a fairly big file with the words from each file. 事实证明,这是一个相当大的文件,其中包含每个文件中的文字。
Now, I have to count the number of occurrences of each word in the Vocabulary with the concatenated text file. 现在,我必须用连接的文本文件计算词汇表中每个单词的出现次数。 My first guess is to keep a sort of an array structure which contains the frequencies of each word. 我的第一个猜测是保留一种包含每个单词的频率的数组结构。 But then again, I would have way too many entries. 但是话又说回来,我将有太多的条目。
Could anyone please give me better suggestions? 有人可以给我更好的建议吗?
Use a dictionary (HashMap) where the words are the keys and the values are the number of occurrences. 使用字典(HashMap),其中单词是键,值是出现的次数。 If the HashSet fits into memory, HashMap should as well. 如果HashSet适合内存,则HashMap也应如此。
您可以尝试使用Tries,并且叶节点可以存储单词的频率。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.