简体   繁体   English

java正则表达式查找子弹列表

[英]java regular expression finding bullet lists

I'm trying to match any bullet list in a free text document. 我正在尝试匹配自由文本文档中的任何项目符号列表。 Bullet lists are defined as any number or lowercase character preceeded by a word delimiter. 项目符号列表定义为以字分隔符开头的任何数字或小写字符。 So for example 所以举个例子

1.  item a
2.  item b

I use the following code to find the bullets: 我使用以下代码来查找项目符号:

Pattern p1 = Pattern.compile("\\s[\\d][\\.\\)]\\s");

This works well as long as the bullet list consist of single digit items. 只要子弹列表由单个数字项组成,这就可以正常工作。 However, as soon as I try multiple digit bullet lists, it won't work (example 12. item c 13. item d ) I tried altering the the pattern to 但是,一旦我尝试多位数子弹列表,它将无法工作(例如12. item c 13. item d )我尝试将模式更改为

Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");   

or 要么

Pattern p1 = Pattern.compile("\\s[\\d]\\+[\\.\\)]\\s");

My interpretation of the regex language is that this would match any case where there are 1 or more digits preceding a ".". 我对正则表达式语言的解释是,这将匹配“。”之前有1位或更多位数的任何情况。 But this doesn't work. 但这不起作用。

Can anyone see what I'm doing wrong? 谁能看到我做错了什么?

Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");

(your second version) should work, but you can simplify it: (你的第二个版本)应该可以工作,但你可以简化它:

Pattern p1 = Pattern.compile("\\s\\d+[.)]\\s");

However, it does expect whitespace before the digit (so it won't match at the start of the string, for example). 但是,它确实期望数字前面的空格(例如,它在字符串的开头不匹配)。 Perhaps a word boundary is useful here: 也许字边界在这里很有用:

Pattern p1 = Pattern.compile("\\b\\d+[.)]\\s");

(FYI: Your third example was trying to match a literal + after a single digit. That's why it failed). (FYI:您的第三个例子是想匹配一个+一个数字后,这就是为什么它失败了。)。

一个更简单的正则表达式(未测试):

\\s(\\d+)[.)]\\s

I assume the problem is that there's not always whitespace in front of the digits. 我假设问题是数字前面并不总是有空格。 Thus change the expression to (Java string version) "\\\\s*\\\\d+[\\\\.\\\\)]\\\\s" . 因此将表达式更改为(Java字符串版本) "\\\\s*\\\\d+[\\\\.\\\\)]\\\\s"

Example: 例:

10. aaa //no whitespace before 10 here, thus the leading whitespace has to be optional
11. bbb //here the whitespace should match the new line which counts as whitespace

As for the lower case character version: 至于小写字符版本:

"\\s*(?:\\d+|[a-z]+)[\\.\\)]\\s"

where (?:\\\\d+|[az]+) means "a sequence of either digits or lower case characters. 其中(?:\\\\d+|[az]+)表示“一个数字或小写字符的序列。

Note that this would still match 123a. 请注意,这仍然匹配123a. even though only the a. 即使只有a. part would be matched. 部分将匹配。 To allow only bullet points in a line, add "(?:^|\\\\n)" (Java string again) at the beginning of the expression, which means the match must either start at the beginning of the text or after a line break. 要仅允许一行中的项目符号,请在表达式的开头添加"(?:^|\\\\n)" (再次使用Java字符串),这意味着匹配必须从文本的开头开始,也可以在一行之后开始打破。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM