简体   繁体   中英

Regex Pattern and Matcher issue

I'm not understanding why my regex pattern doesn't seem to work. Here is an example:

String token = "23030G40KT";

Pattern p = Pattern
                .compile("(\\d{3}|VRB)|(\\d{2,3})|(G\\d{2,3})?|(KT|MPS|KMH)");
Matcher m = p.matcher(token);

while(m.find()){
    System.out.println(m.group());
}

That prints out:

230
30
G40

(With two following blank lines that aren't showing here)

I'd like to print:

230
30
G40
KT

with no blank lines. What do I need to change?

You could remove the ? quantifier:

Pattern.compile("(\\d{3}|VRB)|(\\d{2,3})|(G\\d{2,3})|(KT|MPS|KMH)")

The reason your original regex doesn't work is described very well in other answers, such as @Reimus's. However, I want to help you simplify it further. Your regex looks complicated but is actually very simple if you break it down.

Let's talk about what your original regex does:

\\\\d{3} - Three decimals

| - Or

VRB - "VRB"

| - Or

\\\\d{2,3} - 2 or 3 decimals

| - Or

G\\\\d{2,3} - "G" followed by 2 or 3 decimals

| - Or

(KT|MPS|KMH) - "KT" or "MPS" or "KMH"

So basically you just have a bunch of things or'd together. Some of them are redundant (such as "3 decimals" and "2 or 3 decimals"). Combine them together and you get fewer cases with no grouping needed.

You can achieve the same results with this simpler regex:

Pattern.compile("G?\\d{2,3}|KT|MPS|KMH|VRB");

Addendum to @Reimeus' answer, which is the correct one.

If the regex engine were to follow POSIX, it would always look for the leftmost, longest match. Note: longest.

But Java's regex isn't posix: when you use an alternation as you do here, it will stop at the first alternation where it finds a match (and all alternations are evaluated from left to right).

For instance, if you try and match regex:

cat|catflap

against input:

catflap

Java's regex engine will match cat . A POSIX regex engine would match catflap .

And POSIX regex engines are a rarity.

In your alternation, the (G\\d{2,3})? does match (the empty string!) As such, the next alternation is not even considered.

The two following blank lines are also matches for that alternation. Note that in the case of an empty match, a regex engine will shift one character in the input (otherwise you'd get an infinite loop!).

I would rather do something like

String token = "23030G40KT";
Pattern p = Pattern.compile("(\\d{3}|VRB)(\\d{2,3})(G\\d{2,3})?(KT|MPS|KMH)");
Matcher m = p.matcher(token);

if(m.matches()) {
    for (int i = 1; i <= m.groupCount(); ++i) {
        System.out.println(m.group(i));
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM