简体   繁体   中英

Splitting Strings in a stream in Java?

I have an assignment where we're reading textfiles and counting the occurrences of each word (ignoring punctuation). We don't have to use streams but I want to practice using them.

So far I am able to read a text file and put each line in a string, and all the strings in a list using this:

try (Stream<String> p = Files.lines(FOLDER_OF_TEXT_FILES)) {
    list = p.map(line -> line.replaceAll("[^A-Za-z0-9 ]", ""))
            .collect(Collectors.toList());
}

However, so far, it simply makes all the lines a single String, so each element of the list is not a word, but a line. Is there a way using streams that I can have each element be a single word, using something like String's split method with regex? Or will I have to handle this outside the stream itself?

I may misunderstood your question. But if you just want comma separated words you can try below code Replace line.replaceAll("[^A-Za-z0-9 ]", "") with Arrays.asList(line.replaceAll("[^A-Za-z0-9 ]", "").split(" ")).stream().collect(Collectors.joining(","))

Again use joining method on the list to get comma separated String of words.

String commaSeperated = list.stream().collect(Collectors.joining(",")) ;

You can perform further operations on the final string as per your requirement.

try this:

    String fileName = "file.txt";
        try {
        Map<String, Long> wordCount = Files.lines(Path.of(fileName))
                .flatMap(line -> Arrays.stream(line.split("\\s+")))
                     .filter(w->w.matches("[a-zA-Z]+"))
                     .sorted(Comparator.comparing(String::length)
                            .thenComparing(String.CASE_INSENSITIVE_ORDER))  
                        .collect(Collectors.groupingBy(w -> w, 
         LinkedHashMap::new, Collectors.counting()));
        wordCount.entrySet().forEach(System.out::println);
        }catch (Exception e) {
            e.printStackTrace();
        }

This is relatively simple. It just splits on white space and counts the words by putting them in a map where the Key is the word and the Value is a long containing the count.

I included a filter to only capture words of nothing but letters. The way this works is that the Lines put into a stream. Each line is then split into words using String.split . Since this creates an array, the flatMap converts all these individual streams of words into a single stream where they are processed. The work horse of this is the Collectors.groupingBy which will group the values in a particular way for each key. In this case, I specified the Collectors.counting() method to increase the count each time the key (ie word) appeared.

As an option, I sorted the words first on length and then alphabetically, ignoring case.

Instead of applying replaceAll on a line, do it on words of the line as follows:

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "Harry is a good cricketer. Tanya is an intelligent student. Bravo!";
        List<String> words = Arrays.stream(str.split("\\s+")).map(s -> s.replaceAll("[^A-Za-z0-9 ]", ""))
                .collect(Collectors.toList());
        System.out.println(words);
    }
}

Output:

[Harry, is, a, good, cricketer, Tanya, is, an, intelligent, student, Bravo]

Note: The regex, \\s+ splits a string on space(s).

First, for each line, we're removing all non-alphanumeric characters (excluding spaces), then we split on space, so all elements are single words. Since we're flatmapping, the stream consists of all words. Then we simply collect using the groupingBy collector, and use counting() as downstream collector. That'll leaves us with a Map<String, Long> were the key is the word and the value is the number of occurrences.

list = p
    .flatMap(line -> Arrays.stream(line.replaceAll("[^0-9A-Za-z ]+", "").split("\\s+")))
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting());

Since line boundaries are irrelevant when want to process words , the preferred way is not to bother with splitting into lines, just to split lines into words, but split the file into words in the first place. You can use something like:

Map<String,Long> wordsAndCounts;
try(Scanner s = new Scanner(Paths.get(path))) {
    wordsAndCounts = s.findAll("\\w+")
        .collect(Collectors.groupingBy(MatchResult::group, Collectors.counting()));
}
wordsAndCounts.forEach((w,c) -> System.out.println(w+":\t"+c));

The findAll method of Scanner requires Java 9 or newer. This answer contains an implementation of findAll for Java 8. This allows to use it on Java 8 and easily migrate to newer versions by just switching to the standard method.

one could use a Pattern.splitAsStream to split a string in a performant way and at the same time replace all non word characters before creating a map of occurrence counts:

Pattern splitter = Pattern.compile("(\\W*\\s+\\W*)+");
String fileStr = Files.readString(Path.of(FOLDER_OF_TEXT_FILES));

Map<String, Long> collect = splitter.splitAsStream(fileStr)
        .collect(groupingBy(Function.identity(), counting()));

System.out.println(collect);

For splitting and removal of non word characters we are using the pattern (\W*\s+\W*)+ where we look for optional non word characters then a space and then again for optional non word characters.

For the entire "read a text file and count each word using streams", I suggest using something like this:

try (Stream<String> lines = Files.lines(FOLDER_OF_TEXT_FILES)) {
    lines.flatMap(l -> Arrays.stream(l.split(" ")))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

}

There is no need to first collect everything into a list, this can be done inline.
Also it's good that you used try-with-resources.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM