简体繁体中英

count # of repeated words in a text file using java

原文 2010-12-31 05:43:16 6 4 java

how do you open a text file in java from a path and count the number of words repeated in the file using tokenizers.

Eg: want to open a file using the path name and being about to read and count the repeated words in the file

4 answers

badpanda is half-right: there is lots of info already out there on how to read words from a file. Don't take his suggestion of using ArrayLists though - all you need is one of the Map implementations (HashMap or TreeMap). Each key is a word in the file, each value is the current count of that word.

Since this is homework, here are a few hints:

The Scanner class can be used as a tokenizer
A Multiset (or Bag ) can be used to count words

And a little bit of detail about the approach which can be taken.

Scanner as a tokenizer

The Scanner class takes a source, such as a InputStream or File and can read data a piece at a time, using one of the many next methods which are available.

If we want to use the Scanner as a tokenizer, we could tell it the way it should split up the text in order to make the tokens.

There is a Scanner.useDelimiter(String) or Scanner.useDelimiter(Pattern) method which can tell the Scanner to split tokens up in a certain way by using regular expressions .

Once the Scanner is properly configured, one can obtain tokens by calling the next method until we run out of text in the text file. (The terminating condition of this loop could be determined by Scanner.hasNext .)

Using a Multiset (or Bag ) to count words

A data structure called a multiset (or bag ) can be used to keep track of words (or tokens) which may have occurred.

A multiset is a set , but can have multiple elements for each of the element. In implementations I've seen, the element in the set will have its multiplicity available by calling some method.

For example, using the Multiset implementation available in Google's Guava library, the Multiset.count(Object) method will return the multiplicity of the given object.

So what does this all mean?

We could use a Multiset to keep track of the count of tokens that appear in the text file read by the Scanner .

By placing the tokens from the Scanner into the Multiset , we could end up with a count of the number of times each token has been encountered in the text file.

From there, we could iterate over the tokens, and find the tokens which have a count of over 2 , which are the tokens which were repeated in the text file.

An alternate approach?

Here's an alternative, from an alternative interpretation of the question:

... and count the number of words repeated in the file ...

If all we need is strictly a "count of repeated words", then there is an alternative approach.

A Set could be used to keep track of tokens which have already been encountered in the file.

On each new token, before we attempt to add the token into the Set , we could check if the token already exists by using the Set.contains(Object) method.

If the word already exists, then we can increment a counter which keeps track of repeated tokens.

If this was not the intention of the question, then it should be mentioned that using precise wording to communicate intent is important, as people who read the question can interpret the question in many different ways! ;)

Find out how to stream a file from a path by googling it (below is the first link I found; if it is not good there are lots more...).

http://www.homeandlearn.co.uk/java/read_a_textfile_in_java.html

Then, create an arraylist of arraylists. Add one entry (ie a new arraylist with the 0 index set to the word) to the initial arraylist for each new word, and add an entry to the corresponding arraylist for each repeated word. Once this is complete for the entire text document, iterate through the arraylist as needed.

Forget tokenizers

Just use the String.split method. It splits a string into a String array, and negates the need for the tokenizer class.

Use a Scanner to read in individual lines from the file.

Use a hash table to count the individual words, this is assuming that extra punctuation on words doesn't matter.

When the Scanner is done reading the file, display each key/value pair where the value is greater than 1.

how do i count words in a text file using java

How to find total count of Words, total count of Vowels, total count of Special Character in a text file using java 8

Count specific words from text file - Java

Count all the words in a file using java Streams

How to count words in a text file, java 8-style

Java program to count lines, words, and chars from a text given file

Java program to count characters, words, and lines from a text file

Java program to count lines, char, and words from a text file

searching in text file specific words using java

Find and replace words in a text file using Java

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question how do i count words in a text file using java How to find total count of Words, total count of Vowels, total count of Special Character in a text file using java 8 Count specific words from text file - Java Count all the words in a file using java Streams How to count words in a text file, java 8-style Java program to count lines, words, and chars from a text given file Java program to count characters, words, and lines from a text file Java program to count lines, char, and words from a text file searching in text file specific words using java Find and replace words in a text file using Java

Related Tags

count # of repeated words in a text file using java

Question

4 answers

solution1
1 2010-12-31 06:08:26

solution2
1 2010-12-31 07:11:53

solution3
0 2010-12-31 06:00:29

solution4
0 2010-12-31 06:17:28

count # of repeated words in a text file using java

Question

4 answers

solution1 1 2010-12-31 06:08:26

solution2 1 2010-12-31 07:11:53

solution3 0 2010-12-31 06:00:29

solution4 0 2010-12-31 06:17:28

solution1
1 2010-12-31 06:08:26

solution2
1 2010-12-31 07:11:53

solution3
0 2010-12-31 06:00:29

solution4
0 2010-12-31 06:17:28