简体   繁体   中英

count # of repeated words in a text file using java

how do you open a text file in java from a path and count the number of words repeated in the file using tokenizers.

Eg: want to open a file using the path name and being about to read and count the repeated words in the file

badpanda is half-right: there is lots of info already out there on how to read words from a file. Don't take his suggestion of using ArrayLists though - all you need is one of the Map implementations (HashMap or TreeMap). Each key is a word in the file, each value is the current count of that word.

Since this is homework, here are a few hints:

  1. The Scanner class can be used as a tokenizer
  2. A Multiset (or Bag ) can be used to count words

And a little bit of detail about the approach which can be taken.

Scanner as a tokenizer

The Scanner class takes a source, such as a InputStream or File and can read data a piece at a time, using one of the many next methods which are available.

If we want to use the Scanner as a tokenizer, we could tell it the way it should split up the text in order to make the tokens.

There is a Scanner.useDelimiter(String) or Scanner.useDelimiter(Pattern) method which can tell the Scanner to split tokens up in a certain way by using regular expressions .

Once the Scanner is properly configured, one can obtain tokens by calling the next method until we run out of text in the text file. (The terminating condition of this loop could be determined by Scanner.hasNext .)

Using a Multiset (or Bag ) to count words

A data structure called a multiset (or bag ) can be used to keep track of words (or tokens) which may have occurred.

A multiset is a set , but can have multiple elements for each of the element. In implementations I've seen, the element in the set will have its multiplicity available by calling some method.

For example, using the Multiset implementation available in Google's Guava library, the Multiset.count(Object) method will return the multiplicity of the given object.

So what does this all mean?

We could use a Multiset to keep track of the count of tokens that appear in the text file read by the Scanner .

By placing the tokens from the Scanner into the Multiset , we could end up with a count of the number of times each token has been encountered in the text file.

From there, we could iterate over the tokens, and find the tokens which have a count of over 2 , which are the tokens which were repeated in the text file.

An alternate approach?

Here's an alternative, from an alternative interpretation of the question:

... and count the number of words repeated in the file ...

If all we need is strictly a "count of repeated words", then there is an alternative approach.

A Set could be used to keep track of tokens which have already been encountered in the file.

On each new token, before we attempt to add the token into the Set , we could check if the token already exists by using the Set.contains(Object) method.

If the word already exists, then we can increment a counter which keeps track of repeated tokens.

If this was not the intention of the question, then it should be mentioned that using precise wording to communicate intent is important, as people who read the question can interpret the question in many different ways! ;)

Find out how to stream a file from a path by googling it (below is the first link I found; if it is not good there are lots more...).

http://www.homeandlearn.co.uk/java/read_a_textfile_in_java.html

Then, create an arraylist of arraylists. Add one entry (ie a new arraylist with the 0 index set to the word) to the initial arraylist for each new word, and add an entry to the corresponding arraylist for each repeated word. Once this is complete for the entire text document, iterate through the arraylist as needed.

Forget tokenizers

Just use the String.split method. It splits a string into a String array, and negates the need for the tokenizer class.

Use a Scanner to read in individual lines from the file.

Use a hash table to count the individual words, this is assuming that extra punctuation on words doesn't matter.

When the Scanner is done reading the file, display each key/value pair where the value is greater than 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM