简体繁体 English

使用java计算文本文件中重复单词的数量

[英]count # of repeated words in a text file using java

原文 2010-12-31 05:43:16 8 4 java

how do you open a text file in java from a path and count the number of words repeated in the file using tokenizers. 如何从路径中打开java中的文本文件，并使用标记器计算文件中重复的单词数。

Eg: want to open a file using the path name and being about to read and count the repeated words in the file 例如：想要使用路径名打开文件，并且要读取和计算文件中重复的单词

4 个解决方案

badpanda is half-right: there is lots of info already out there on how to read words from a file. badpanda是半右的：有很多关于如何从文件中读取单词的信息。 Don't take his suggestion of using ArrayLists though - all you need is one of the Map implementations (HashMap or TreeMap). 不要接受他使用ArrayLists的建议 - 你只需要一个Map实现（HashMap或TreeMap）。 Each key is a word in the file, each value is the current count of that word. 每个键都是文件中的一个单词，每个值都是该单词的当前计数。

Since this is homework, here are a few hints: 由于这是家庭作业，这里有一些提示：

The Scanner class can be used as a tokenizer Scanner类可用作标记器
A Multiset (or Bag ) can be used to count words Multiset （或Bag ）可用于计算单词

And a little bit of detail about the approach which can be taken. 关于可以采取的方法的一些细节。

Scanner as a tokenizer 扫描器作为令牌生成器

The Scanner class takes a source, such as a InputStream or File and can read data a piece at a time, using one of the many next methods which are available. Scanner类使用诸如InputStream或File类的源，并且可以使用许多可用的next方法之一一次读取一条数据。

If we want to use the Scanner as a tokenizer, we could tell it the way it should split up the text in order to make the tokens. 如果我们想将Scanner用作标记器，我们可以告诉它应该分割文本以制作标记的方式。

There is a Scanner.useDelimiter(String) or Scanner.useDelimiter(Pattern) method which can tell the Scanner to split tokens up in a certain way by using regular expressions . 有一个Scanner.useDelimiter(String)或Scanner.useDelimiter(Pattern)方法，可以通过使用正则表达式告诉Scanner以某种方式拆分令牌。

Once the Scanner is properly configured, one can obtain tokens by calling the next method until we run out of text in the text file. 正确配置Scanner ，可以通过调用next方法获取令牌，直到我们用完文本文件中的文本为止。 (The terminating condition of this loop could be determined by Scanner.hasNext .) （此循环的终止条件可以由Scanner.hasNext确定。）

Using a Multiset (or Bag ) to count words 使用Multiset （或Bag ）计算单词

A data structure called a multiset (or bag ) can be used to keep track of words (or tokens) which may have occurred. 称为多集（或包）的数据结构可用于跟踪可能已发生的单词（或标记）。

A multiset is a set , but can have multiple elements for each of the element. 多重集是一个集合，但每个元素可以有多个元素。 In implementations I've seen, the element in the set will have its multiplicity available by calling some method. 在我所看到的实现中，通过调用某些方法可以使集合中的元素具有多重性。

For example, using the Multiset implementation available in Google's Guava library, the Multiset.count(Object) method will return the multiplicity of the given object. 例如，使用Google的Guava库中提供的Multiset实现， Multiset.count(Object)方法将返回给定对象的多重性。

So what does this all mean? 那么，这意味着什么？

We could use a Multiset to keep track of the count of tokens that appear in the text file read by the Scanner . 我们可以使用Multiset来跟踪由Scanner读取的文本文件中出现的令牌计数。

By placing the tokens from the Scanner into the Multiset , we could end up with a count of the number of times each token has been encountered in the text file. 通过将标记从Scanner放入Multiset ，我们可以得出在文本文件中遇到每个标记的次数的计数。

From there, we could iterate over the tokens, and find the tokens which have a count of over 2 , which are the tokens which were repeated in the text file. 从那里，我们可以遍历令牌，并找到计数超过2的令牌，这些令牌是在文本文件中重复的令牌。

An alternate approach? 另一种方法？

Here's an alternative, from an alternative interpretation of the question: 从问题的另一种解释来看，这是另一种选择：

... and count the number of words repeated in the file ... ...并计算文件中重复的单词数量......

If all we need is strictly a "count of repeated words", then there is an alternative approach. 如果我们仅需要严格的“重复单词计数”，那么就有另一种方法。

A Set could be used to keep track of tokens which have already been encountered in the file. Set可以用于跟踪文件中已经遇到的令牌。

On each new token, before we attempt to add the token into the Set , we could check if the token already exists by using the Set.contains(Object) method. 在每个新令牌上，在我们尝试add令牌add到Set ，我们可以使用Set.contains(Object)方法检查令牌是否已经存在。

If the word already exists, then we can increment a counter which keeps track of repeated tokens. 如果单词已经存在，那么我们可以增加一个计数器来跟踪重复的标记。

If this was not the intention of the question, then it should be mentioned that using precise wording to communicate intent is important, as people who read the question can interpret the question in many different ways! 如果这不是问题的意图，那么应该指出，使用精确的措辞来传达意图很重要，因为阅读该问题的人可以用许多不同的方式来解释该问题！ ;) ;）

Find out how to stream a file from a path by googling it (below is the first link I found; if it is not good there are lots more...). 了解如何通过谷歌搜索路径中的文件流（下面是我找到的第一个链接;如果它不好，还有更多......）。

http://www.homeandlearn.co.uk/java/read_a_textfile_in_java.html http://www.homeandlearn.co.uk/java/read_a_textfile_in_java.html

Then, create an arraylist of arraylists. 然后，创建一个arraylists的arraylist。 Add one entry (ie a new arraylist with the 0 index set to the word) to the initial arraylist for each new word, and add an entry to the corresponding arraylist for each repeated word. 为每个新单词添加一个条目（即将0索引设置为单词的新arraylist）到初始arraylist，并为每个重复单词添加一个条目到相应的arraylist。 Once this is complete for the entire text document, iterate through the arraylist as needed. 完成整个文本文档后，根据需要遍历arraylist。

Forget tokenizers 忘记令牌生成器

Just use the String.split method. 只需使用String.split方法。 It splits a string into a String array, and negates the need for the tokenizer class. 它将字符串拆分为String数组，并且无需使用tokenizer类。

Use a Scanner to read in individual lines from the file. 使用扫描仪读取文件中的各行。

Use a hash table to count the individual words, this is assuming that extra punctuation on words doesn't matter. 使用哈希表来计算单个单词，这假设单词上的额外标点符号无关紧要。

When the Scanner is done reading the file, display each key/value pair where the value is greater than 1. 扫描程序完成文件读取后，显示值大于1的每个键/值对。