简体   繁体   中英

java regular expression finding bullet lists

I'm trying to match any bullet list in a free text document. Bullet lists are defined as any number or lowercase character preceeded by a word delimiter. So for example

1.  item a
2.  item b

I use the following code to find the bullets:

Pattern p1 = Pattern.compile("\\s[\\d][\\.\\)]\\s");

This works well as long as the bullet list consist of single digit items. However, as soon as I try multiple digit bullet lists, it won't work (example 12. item c 13. item d ) I tried altering the the pattern to

Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");   

or

Pattern p1 = Pattern.compile("\\s[\\d]\\+[\\.\\)]\\s");

My interpretation of the regex language is that this would match any case where there are 1 or more digits preceding a ".". But this doesn't work.

Can anyone see what I'm doing wrong?

Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");

(your second version) should work, but you can simplify it:

Pattern p1 = Pattern.compile("\\s\\d+[.)]\\s");

However, it does expect whitespace before the digit (so it won't match at the start of the string, for example). Perhaps a word boundary is useful here:

Pattern p1 = Pattern.compile("\\b\\d+[.)]\\s");

(FYI: Your third example was trying to match a literal + after a single digit. That's why it failed).

一个更简单的正则表达式(未测试):

\\s(\\d+)[.)]\\s

I assume the problem is that there's not always whitespace in front of the digits. Thus change the expression to (Java string version) "\\\\s*\\\\d+[\\\\.\\\\)]\\\\s" .

Example:

10. aaa //no whitespace before 10 here, thus the leading whitespace has to be optional
11. bbb //here the whitespace should match the new line which counts as whitespace

As for the lower case character version:

"\\s*(?:\\d+|[a-z]+)[\\.\\)]\\s"

where (?:\\\\d+|[az]+) means "a sequence of either digits or lower case characters.

Note that this would still match 123a. even though only the a. part would be matched. To allow only bullet points in a line, add "(?:^|\\\\n)" (Java string again) at the beginning of the expression, which means the match must either start at the beginning of the text or after a line break.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM