简体   繁体   中英

Java Regex: Optional Matching

I've been using the following Regex to extract a zip code from a bunch of text:

    "\\d{5}\\-?[1-9]?[1-9]?[1-9]?[1-9]?"

My intention of making the last 4 [1-9] optional (using ? ) was to be able to extract both 5 digit zip codes and 5 digit zip codes with + 4 such as 11001-1010.

However, it only matches the first two digits of the last four numbers even though I put 4 digits at the end.

For example, in the zip code 11001-1010 it would match 11001-10.

Anyone know why?

You can use \\\\d{5}\\\\-\\\\d{0,4} which allows you to match 0 to 4 digits after - .

EDIT

From the comment : But then the - won't be optional.

For that you can use \\\\d{5}(\\\\-\\\\d{0,4})? to make group of - and digits after dash optional.

It's stopping at the first 0 in the suffix, "\\d{5}\\-?[1-9]?[1-9]?[1-9]?[1-9]?" So in your example, it only matches up to 11001-1 Does "\\d{5}\\-?[0-9]?[0-9]?[0-9]?[0-9]?" work ok? The other answers are probably cleaner, but that is the bug.

Looks ok per this

Simple answer to question: For zip code 11001-1010 your regex would only match 11001-1 because the optional 4 digits after the - cannot be 0 .

For the unstated question of how to fix that, it depends on whether you only want to match an optional +4, or you want to also match +3, +2, +1, and +0, like your expression would.

Matching Zip5 with optional +4, eg matching 11001-1010 and 11001 :

"\\d{5}(?:-\\d{4})?"

Matching Zip5 with optional +N, eg matching 11001-1010 , 11001-101 , 11001-10 , 11001-1 , 11001- , and 11001 :

"\\d{5}(?:-\\d{0,4})?"

Update

Now, if you want to make sure it doesn't match the 56789-1234 of 123456789-123456789 or abcd56789-1234qwerty , you can add a word-boundary check:

"\\b\\d{5}(?:-\\d{4})?\\b"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM