简体   繁体   中英

How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box. Here is two examples:

  1. str = "some text po box 12456 Floor 105 streetName Street";
  2. str = "po box 1011";
  3. str = "post office Box 12 Floor 105 Tallapoosa Street";
  4. str = "leclair ryan pc po Box 2499 8th floor 951 east byrd street";
  5. str = "box 1 slot 3 building 2 136 harvey road";

Here is my pattern and code:

Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)"); 
       Matcher m = p.matcher(str); 
       int count =0;
       while(m.find()) {
           count++;
           System.out.println("Match number "+count);
           System.out.println("start(): "+m.start());
           System.out.println("end(): "+m.end());
       }

It works with the second example and note for the first one! If change my pattern to the following:

Pattern p = Pattern.compile("p.*o.*box \\d+ ");

It works just for the first example. The question is how to group the Regex for end of string "\\z" and Regex for whitespace "\\s" or " "?

New Pattern: Pattern p = Pattern.compile("(?i)((p.*o. box\\s \\w\\s*\\d*(\\z|\\s*)|(box\\s*\\w\\s*\\d*(\\z|\\s*)) ))");

There are a couple items in your regex that look like they need work. From what I understand you want to extract the PO Box number from strings of such format that you've provided. Given that, the following regex will accomplish what you want, with a following explanation. See it in action here: https://regex101.com/r/cQ8lH3/2

Pattern p = Pattern.compile("p\\.?o\\.? box [^ \\r\\n\\t]+");

Firstly, you need to use only ONE slash, for escape sequences. Also, you must escape the dots. If you do not escape the dots, regex will match . as ANY single character. \\. will instead match a dot symbol.

Next, you need to change the * quantifier after the \\. to a ? . Why? The * symbol will match zero or more of the preceding symbol while the ? quantifier will match only one or none.

Finally rethink how you're matching the box number. Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. [^ \\r\\n\\t]+ will match all characters that are NOT a space ( ), carriage return ( \\r ), newline ( \\n ), or tab ( \\t ). Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file.

Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want.

You can leverage the following code :

String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)"); 
Matcher m = p.matcher(str); 
int count =0;
while(m.find()) {
      count++;
      System.out.println("Match: "+m.group(0));
      System.out.println("Digits: "+m.group(1));
      System.out.println("Match number "+count);
      System.out.println("start(): "+m.start());
      System.out.println("end(): "+m.end());
}

To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern.

Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . optionally. So, you need just ? quantifier and to escape the dot so as to match a literal dot. Note how I used (...) to capture digits into Group 1 (it is called a capturing group ). The group where you match the end of the string or space is inside a non-capturing grouo ( (?:...) ) that is used for grouping only, not for storing its value in the memory buffer. Since you wanted to match a word boundary there, I suggest replacing (?:\\\\z|\\\\s) with a mere \\\\b :

Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b"); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM