简体   繁体   English

如何在Java中构建正则表达式以检测空格或字符串结尾?

[英]How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box. 我试图建立一个正则表达式来查找和提取包含邮政信箱的字符串。 Here is two examples: 这是两个示例:

  1. str = "some text po box 12456 Floor 105 streetName Street"; str =“ some text po box 12456 Floor 105 streetName Street”;
  2. str = "po box 1011"; str =“邮政信箱1011”;
  3. str = "post office Box 12 Floor 105 Tallapoosa Street"; str =“ Tallapoosa Street 105号邮政信箱12楼”;
  4. str = "leclair ryan pc po Box 2499 8th floor 951 east byrd street"; str =“ leclair ryan pc po Box 2499 8th floor 951 East Byrd street”;
  5. str = "box 1 slot 3 building 2 136 harvey road"; str =“ Harvey Road 2 136号1号槽3号大楼”;

Here is my pattern and code: 这是我的模式和代码:

Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)"); 
       Matcher m = p.matcher(str); 
       int count =0;
       while(m.find()) {
           count++;
           System.out.println("Match number "+count);
           System.out.println("start(): "+m.start());
           System.out.println("end(): "+m.end());
       }

It works with the second example and note for the first one! 它与第二个示例配合使用,并为第一个示例提供注释! If change my pattern to the following: 如果将我的模式更改为以下内容:

Pattern p = Pattern.compile("p.*o.*box \\d+ "); 模式p = Pattern.compile(“ p。* o。* box \\ d +”);

It works just for the first example. 它仅适用于第一个示例。 The question is how to group the Regex for end of string "\\z" and Regex for whitespace "\\s" or " "? 问题是如何将正则表达式分组为字符串“ \\ z”的末尾,将正则表达式分组为空格“ \\ s”或“”?

New Pattern: Pattern p = Pattern.compile("(?i)((p.*o. box\\s \\w\\s*\\d*(\\z|\\s*)|(box\\s*\\w\\s*\\d*(\\z|\\s*)) ))"); 新模式:模式p = Pattern.compile(“(?i)((p。* o。box \\ s \\ w \\ s * \\ d *(\\ z | \\ s *)|(box \\ s * \\ w \\ s * \\ d *(\\ z | \\ s *))))“);

There are a couple items in your regex that look like they need work. 正则表达式中有几项看起来需要工作。 From what I understand you want to extract the PO Box number from strings of such format that you've provided. 据我了解,您想从您提供的这种格式的字符串中提取邮政信箱号码。 Given that, the following regex will accomplish what you want, with a following explanation. 鉴于此,以下正则表达式将完成您所需的内容,并提供以下说明。 See it in action here: https://regex101.com/r/cQ8lH3/2 在此处查看其运行情况: https : //regex101.com/r/cQ8lH3/2

Pattern p = Pattern.compile("p\\.?o\\.? box [^ \\r\\n\\t]+");

Firstly, you need to use only ONE slash, for escape sequences. 首先,对于转义序列,您只需使用一个斜杠。 Also, you must escape the dots. 另外,您必须避开这些点。 If you do not escape the dots, regex will match . 如果您不逃避这些点,则正则表达式将匹配. as ANY single character. 作为任何单个字符。 \\. will instead match a dot symbol. 而是匹配点符号。

Next, you need to change the * quantifier after the \\. 接下来,您需要在\\.后面更改* 量词 \\. to a ? 到一个? . Why? 为什么? The * symbol will match zero or more of the preceding symbol while the ? *符号将匹配零个或多个前面的符号,而? quantifier will match only one or none. 量词仅匹配一个或不匹配。

Finally rethink how you're matching the box number. 最后重新考虑一下您如何匹配箱号。 Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. 与其匹配所有字符,然后匹配空格,不如匹配所有非空格。 [^ \\r\\n\\t]+ will match all characters that are NOT a space ( [^ \\r\\n\\t]+将匹配所有非空格字符( ), carriage return ( \\r ), newline ( \\n ), or tab ( \\t ). ),回车符( \\r ),换行符( \\n )或制表符( \\t )。 Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file. 因此,它将消耗存储箱编号并在遇到任何空白或文件结尾时立即停止。

Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want. 为了使您的代码适用于您给出的示例,其中的某些更改可能不是必需的,但它们是构建所需正则表达式的正确方法。

You can leverage the following code : 您可以利用以下代码

String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)"); 
Matcher m = p.matcher(str); 
int count =0;
while(m.find()) {
      count++;
      System.out.println("Match: "+m.group(0));
      System.out.println("Digits: "+m.group(1));
      System.out.println("Match number "+count);
      System.out.println("start(): "+m.start());
      System.out.println("end(): "+m.end());
}

To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern. 要使模式不区分大小写,只需将Pattern.CASE_INSENSITIVE标志添加到Pattern.compile声明中,或将内联(?i)修饰符添加到模式之前。

Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . 另外, .*匹配除换行符以外的任何其他字符0次或多次,我想您想匹配. optionally. 可选。 So, you need just ? 所以,您只需要? quantifier and to escape the dot so as to match a literal dot. 量词和转义点以匹配文字点。 Note how I used (...) to capture digits into Group 1 (it is called a capturing group ). 请注意我是如何使用(...)将数字捕获到组1中的(称为捕获组 )。 The group where you match the end of the string or space is inside a non-capturing grouo ( (?:...) ) that is used for grouping only, not for storing its value in the memory buffer. 与字符串或空格结尾匹配的组位于非捕获组(?:...) )内,该组仅用于分组,而不用于将其值存储在内存缓冲区中。 Since you wanted to match a word boundary there, I suggest replacing (?:\\\\z|\\\\s) with a mere \\\\b : 由于您想在那里匹配单词边界,因此我建议仅用\\\\b代替(?:\\\\z|\\\\s)

Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b"); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM