简体   繁体   中英

Compare two text files and see how many times the words in the 2nd file occurs in the 1st file

I am trying find out how many times the words in text2.txt appears in text1.txt. When I run my code, it only prints out that text2.txt words appears in text1.txt 0 times.

Some of text1.txt looks like this:

2 A well-made but emotionally scattered film whose hero gives his heart only to the dog .     
2 Those who love Cinema Paradiso will find the new scenes interesting , but few will find the movie improved .

Some of text2.txt looks like this:

will
dog
the
movie
find

Here is my code:

try {    
File file1 = new File("text1.txt");
File file2 = new File("text2.txt");
Scanner scan1 = new Scanner(file1);
Scanner scan2 = new Scanner(file2);
String text1;
String text2;
int wordCount = 0;
while(scan1.hasNext() && scan2.hasNext()) {
    text1 = scan1.nextLine();
    text2 = scan2.nextLine();
    if(text1.contains(text2)) {
        wordCount++;

    }
    System.out.println(file2 + " appears in " + file1 + " " + wordCount +" times");

}
} catch(Exception e) {
        System.out.println("Error! \n" + e + "\n");
    }
}

nextLine() returns a string up to the newline character.

Your text1 has two lines and your text2 file has more. As for now you are just comparing if the first line of the file contains the first word of the second file. Then you are checking if the second line of the first file contains the second word of the second file.

You should iterate trough every word of the first file and compare with every word of the second one by one. You can achieve that by converting the words from the files to an two arrays and then using two for loops.

Also compare() return true if the string appears at least once so if the word appears twice in the string you would not know that.

If text2.txt is a dictionary, it needs to be read into a set of words first.

Then, while reading the contents of text1.txt , each line needs to be split into words, then you check if a word is in the dictionary, and if yes, count its occurrence, so the result should be a frequency map.

Using Stream API, the implementation may look as follows:

Set<String> dictionary = Files
    .lines(Paths.get("text2.txt")) // Stream<String>
    .collect(Collectors.toSet()); // assuming each line is a separate word

Map<String, Long> freqMap = Files
    .lines(Paths.get("text1.txt")) // Stream<String> multi-word lines
    .flatMap(s -> Arrays.stream(s.split("\\s+"))) // Stream<String> words separated with one or more whitespaces
    .filter(dictionary::contains) // Stream<String> - keep only dictionary words
    .map(Collectors.groupingBy(
        w -> w, // or Function.identity()
        Collectors.counting() // count frequency as long
    ));

// output the map sorted by descending frequency and words
freqMap.entrySet()
    .stream()
    .sorted(Map.Entry.<String, Long>comparingByValue().reversed()
        .thenComparing(Map.Entry.comparingByKey())
    ) // Stream<Map.Entry<String, Long>>
    .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

Depending on the requirements to case sensitivity, it may be needed to convert the words in dictionary into a single case (lowerCase or UpperCase) and use the same case:

Set<String> dictionary = Files
    .lines(Paths.get("text2.txt")) // Stream<String>
    .map(String::toLowerCase)
    .collect(Collectors.toSet()); // assuming each line is a separate word

// in freqMap..
// ...
    .filter(word -> dictionary.contains(word.toLowerCase())) // Stream<String>
// ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM