简体   繁体   English

计数正则表达式与流匹配

[英]Count regex matches with streams

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. 我试图用简单的Java 8 lambdas / stream解决方案来计算正则表达式模式的匹配数。 For example for this pattern/matcher : 例如,对于此模式/匹配器:

final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");

There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. splitAsStream方法splitAsStream定模式上的文本分割而不是匹配模式。 Although it's elegant and preserves immutability, it's not always correct : 虽然它很优雅并且保留了不变性,但它并不总是正确的:

// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();

// count is 0, wrong
final long count = pattern.splitAsStream("1").count();

I also tried (ab)using an IntStream . 我也试过(ab)使用IntStream The problem is I have to guess how many times I should call matcher.find() instead of until it returns false. 问题是我必须猜测我应该多少次调用matcher.find()而不是它返回false。

final long count = IntStream
        .iterate(0, i -> matcher.find() ? 1 : 0)
        .limit(100)
        .sum();

I am familiar with the traditional solution while (matcher.find()) count++; 我熟悉传统的解决方案while (matcher.find()) count++; where count is mutable. count是可变的。 Is there a simple way to do that with Java 8 lambdas/streams ? 使用Java 8 lambdas / streams有一种简单的方法吗?

To use the Pattern::splitAsStream properly you have to invert your regex. 要正确使用Pattern::splitAsStream ,您必须反转正则表达式。 That means instead of having \\\\d+ (which would split on every number) you should use \\\\D+ . 这意味着你不应该使用\\\\D+ \\\\d+ (它会在每个数字上分开),而应该使用\\\\D+ This gives you ever number in your String. 这为您提供了String中的编号。

final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();

The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame. Pattern.splitAsStream的javadoc中相当人为的语言可能是罪魁祸首。

The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence. 此方法返回的流包含输入序列的每个子字符串 ,该子字符串 由与此模式匹配的另一个 序列终止,或者由输入序列的末尾终止。

If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas , not the numbers. 如果你打印出1,2,3,4所有匹配,你可能会惊讶地发现它实际上是在返回逗号 ,而不是数字。

    System.out.println("[" + pattern.splitAsStream("1,2,3,4")
            .collect(Collectors.joining("!")) + "]");

prints [!,!,!,] . 打印[!,!,!,] The odd bit is why it is giving you 4 and not 3 . 奇怪的是为什么它给你4而不是3

Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string. 显然这也解释了为什么"1"给出0因为字符串中的数字之间没有字符串。

A quick demo: 快速演示:

private void test(Pattern pattern, String s) {
    System.out.println(s + "-[" + pattern.splitAsStream(s)
            .collect(Collectors.joining("!")) + "]");
}

public void test() {
    final Pattern pattern = Pattern.compile("\\d+");
    test(pattern, "1,2,3,4");
    test(pattern, "a1b2c3d4e");
    test(pattern, "1");
}

prints 版画

1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]

You can extend AbstractSpliterator to solve this: 您可以扩展AbstractSpliterator来解决此问题:

static class SpliterMatcher extends AbstractSpliterator<Integer> {
    private final Matcher m;

    public SpliterMatcher(Matcher m) {
        super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
        this.m = m;
    }

    @Override
    public boolean tryAdvance(Consumer<? super Integer> action) {
        boolean found = m.find();
        if (found)
            action.accept(m.groupCount());
        return found;
    }
}

final Pattern pattern = Pattern.compile("\\d+");

Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1

matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4


matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0

Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ? 不久,您有一个stream of String和一个String pattern :这些字符串中有多少与此模式匹配?

final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
      .filter(str -> str.matches("\\d+"))
      .count();

where first line can be another way to stream List<String>().stream() , ... 第一行可以是另一种流式传输List<String>().stream()...

Am I wrong ? 我错了吗 ?

Java 9 Java 9

You may use Matcher#results() to get hold of all matches: 您可以使用Matcher#results()来获取所有匹配项:

Stream<MatchResult> results() Stream<MatchResult> results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. 返回与模式匹配的输入序列的每个子序列的匹配结果流 The match results occur in the same order as the matching subsequences in the input sequence. 匹配结果的顺序与输入序列中的匹配子序列的顺序相同。

Java 8 and lower Java 8及更低版本

Another simple solution based on using a reverse pattern: 基于使用反向模式的另一个简单解决方案:

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1

Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split ). 这里,所有非数字都从字符串的开头和结尾删除,然后字符串被非数字序列拆分而不报告任何空的尾随空格元素(因为0作为限制参数传递给split )。

See this demo : 这个演示

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);    // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM