简体   繁体   中英

Count regex matches with streams

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :

final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");

There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :

// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();

// count is 0, wrong
final long count = pattern.splitAsStream("1").count();

I also tried (ab)using an IntStream . The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.

final long count = IntStream
        .iterate(0, i -> matcher.find() ? 1 : 0)
        .limit(100)
        .sum();

I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?

To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\\\d+ (which would split on every number) you should use \\\\D+ . This gives you ever number in your String.

final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();

The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.

The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.

If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas , not the numbers.

    System.out.println("[" + pattern.splitAsStream("1,2,3,4")
            .collect(Collectors.joining("!")) + "]");

prints [!,!,!,] . The odd bit is why it is giving you 4 and not 3 .

Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.

A quick demo:

private void test(Pattern pattern, String s) {
    System.out.println(s + "-[" + pattern.splitAsStream(s)
            .collect(Collectors.joining("!")) + "]");
}

public void test() {
    final Pattern pattern = Pattern.compile("\\d+");
    test(pattern, "1,2,3,4");
    test(pattern, "a1b2c3d4e");
    test(pattern, "1");
}

prints

1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]

You can extend AbstractSpliterator to solve this:

static class SpliterMatcher extends AbstractSpliterator<Integer> {
    private final Matcher m;

    public SpliterMatcher(Matcher m) {
        super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
        this.m = m;
    }

    @Override
    public boolean tryAdvance(Consumer<? super Integer> action) {
        boolean found = m.find();
        if (found)
            action.accept(m.groupCount());
        return found;
    }
}

final Pattern pattern = Pattern.compile("\\d+");

Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1

matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4


matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0

Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?

final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
      .filter(str -> str.matches("\\d+"))
      .count();

where first line can be another way to stream List<String>().stream() , ...

Am I wrong ?

Java 9

You may use Matcher#results() to get hold of all matches:

Stream<MatchResult> results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.

Java 8 and lower

Another simple solution based on using a reverse pattern:

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1

Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split ).

See this demo :

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);    // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM