简体   繁体   English

查找哈希集中每个单词在文本文档中出现的次数

[英]Finding the number of times each word in a hashset occurs in text document

I'm implementing a Naive Bayes text classification algorithm in Java. 我正在用Java实现Naive Bayes文本分类算法。

What I have done so far is, declare a hashset called Vocabulary which stores all the unique words from a given text file (test file). 到目前为止,我所做的是声明一个名为Vocabulary的哈希集,该哈希集存储给定文本文件(测试文件)中的所有唯一单词。

One of the steps in the algorithm is to concatenate all the members of the test files into a single text file. 算法中的步骤之一是将测试文件的所有成员连接到单个文本文件中。 This turns out to be a fairly big file with the words from each file. 事实证明,这是一个相当大的文件,其中包含每个文件中的文字。

Now, I have to count the number of occurrences of each word in the Vocabulary with the concatenated text file. 现在,我必须用连接的文本文件计算词汇表中每个单词的出现次数。 My first guess is to keep a sort of an array structure which contains the frequencies of each word. 我的第一个猜测是保留一种包含每个单词的频率的数组结构。 But then again, I would have way too many entries. 但是话又说回来,我将有太多的条目。

Could anyone please give me better suggestions? 有人可以给我更好的建议吗?

Use a dictionary (HashMap) where the words are the keys and the values are the number of occurrences. 使用字典(HashMap),其中单词是键,值是出现的次数。 If the HashSet fits into memory, HashMap should as well. 如果HashSet适合内存,则HashMap也应如此。

您可以尝试使用Tries,并且叶节点可以存储单词的频率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找元素在Java中出现的次数 - Finding the number of times an element occurs in Java 如果字典单词中的所有字符都出现在短语中,则正则表达式匹配。 每个字符出现的次数也必须相互匹配 - Regex match if all characters in a dictionary word are present in the phrase. The number of times each character occurs must also match in each other 查找文档中出现的单词或短语的次数 - Find how many times a word or phrase occurs in a document 查找字符串中连续和非连续表达式的次数 - Finding the Number of Times an Expression Occurs in a String Continuously and Non Continuously 检查输入条目中的每个位置并返回出现字符的次数 - Check each position in the input entry and return the number of times a character occurs 在Java中2d数组的每一列中出现值的次数? - Number of times a value occurs in each column of a 2d-array in Java? 如何计算一个单词在文本文件中出现的次数 - How to count the number of times a word is in the text file 从较长的文本/文件中提取出现在特定单词之后的数字 - Extracting a number that occurs after a specific word from a longer text/file 一个数字出现多少次 - How many times a number occurs 按每个单词查找文本文件的行号 - Find the line number of a text file by each word
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM