简体   繁体   English

如何使用java 8在文本文件中查找单词总数、元音总数、特殊字符总数

[英]How to find total count of Words, total count of Vowels, total count of Special Character in a text file using java 8

I have a text file and i want to check我有一个文本文件,我想检查一下
- total words count in file - 文件中的总字数
- total vowels count in file - 文件中的元音总数
- total special character in file - 文件中的特殊字符总数

By using Java 8 Streams.通过使用 Java 8 Streams。

i want output as a Map in a single iteration if possible ie如果可能,我希望在一次迭代中输出为 Map

{"totalWordCount":10,"totalVowelCount":10,"totalSpecialCharacter":10}

i tried below code我试过下面的代码

    Long wordCount=Files.lines(child).parallel().flatMap(line -> Arrays.stream(line.trim().split(" ")))
                            .map(word -> word.replaceAll("[^a-zA-Z]", "").toLowerCase().trim())
                            .filter(word -> !word.isEmpty())
                            .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).values().stream().reduce(0L, Long::sum)

but it is giving me only total word count, i am thinking if its possible to return a single map which contain output as above with all count.但它只给我总字数,我在想是否有可能返回一个包含所有计数的输出的单个地图。

If we only had to count special characters and vowels, we could use something like this:如果我们只需要计算特殊字符和元音,我们可以这样使用:

Map<String,Long> result;
try(Stream<String> lines = Files.lines(path)) {
    result = lines
        .flatMap(Pattern.compile("\\s+")::splitAsStream)
        .flatMapToInt(String::chars)
        .filter(c -> !Character.isAlphabetic(c) || "aeiou".indexOf(c) >= 0)
        .mapToObj(c -> "aeiou".indexOf(c)>=0? "totalVowelCount": "totalSpecialCharacter")
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}

First we flatten the stream of lines to a stream of words, then to a stream of characters, to group them by their type.首先,我们将行流扁平化为单词流,然后扁平化为字符流,以按类型对它们进行分组。 This works smoothly as “special character” and “vowel” are mutual exclusive.这很顺利,因为“特殊字符”和“元音”是互斥的。 In principle, the flattening to words could have been omitted if we just extend the filter to skip white-space characters, but here, it helps getting to a solution counting words.原则上,如果我们只是扩展过滤器以跳过空白字符,则可以省略对单词的扁平化,但在这里,它有助于获得计算单词的解决方案。

Since words are a different kind of entity than characters, counting them in the same operation is not that straight-forward.由于单词是与字符不同的实体,因此在同一操作中计算它们并不是那么简单。 One solution is to inject a pseudo character for each word and count it just like other characters at the end.一种解决方案是为每个单词注入一个伪字符,并在末尾像其他字符一样对其进行计数。 Since all actual characters are positive, we can use -1 for that:由于所有实际字符都是正数,我们可以使用-1来表示:

Map<String,Long> result;
try(Stream<String> lines = Files.lines(path)) {
    result = lines.flatMap(Pattern.compile("\\s+")::splitAsStream)
        .flatMapToInt(w -> IntStream.concat(IntStream.of(-1), w.chars()))
        .mapToObj(c -> c==-1? "totalWordCount": "aeiou".indexOf(c)>=0? "totalVowelCount":
                Character.isAlphabetic(c)? "totalAlphabetic": "totalSpecialCharacter")
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}

This adds a "totalAlphabetic" category in addition to the others into the result map.除了其他类别之外,这还会在结果映射中添加一个"totalAlphabetic"类别。 If you do not want that, you can insert a .filter(cat -> !cat.equals("totalAlphabetic")) step between the mapToObj and collect steps.如果您不想这样,您可以在mapToObjcollect步骤之间插入一个.filter(cat -> !cat.equals("totalAlphabetic"))步骤。 Or use a filter like in the first solution before the mapToObj step.或者在mapToObj步骤之前使用第一个解决方案中的过滤器。

As an additional note, this solution does more work than necessary, because it splits the input into lines, which is not necessary as we can treat line breaks just like other white-space, ie as a word boundary.作为附加说明,该解决方案做了比必要更多的工作,因为它将输入拆分为行,这不是必需的,因为我们可以像对待其他空格一样对待换行符,即作为单词边界。 Starting with Java 9, we can use Scanner for the job:从 Java 9 开始,我们可以使用Scanner来完成这项工作:

Map<String,Long> result;
try(Scanner scanner = new Scanner(path)) {
    result = scanner.findAll("\\S+")
        .flatMapToInt(w -> IntStream.concat(IntStream.of(-1), w.group().chars()))
        .mapToObj(c -> c==-1? "totalWordCount": "aeiou".indexOf(c)>=0? "totalVowelCount":
                Character.isAlphabetic(c)? "totalAlphabetic": "totalSpecialCharacter")
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}

This will split the input into words in the first place without treating line breaks specially.这将首先将输入拆分为单词,而无需特别处理换行符。 This answer contains a Java 8 compatible implementation of Scanner.findAll .此答案包含Scanner.findAll的 Java 8 兼容实现。

The solutions above consider every character which is neither white-space nor alphabetic as “special character”.上述解决方案将每个既不是空格也不是字母的字符视为“特殊字符”。 If your definition of “special character” is different, it should not be too hard to adapt the solutions.如果您对“特殊字符”的定义不同,调整解决方案应该不会太难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM