[英]java regular expression finding bullet lists
I'm trying to match any bullet list in a free text document. 我正在尝试匹配自由文本文档中的任何项目符号列表。 Bullet lists are defined as any number or lowercase character preceeded by a word delimiter.
项目符号列表定义为以字分隔符开头的任何数字或小写字符。 So for example
所以举个例子
1. item a
2. item b
I use the following code to find the bullets: 我使用以下代码来查找项目符号:
Pattern p1 = Pattern.compile("\\s[\\d][\\.\\)]\\s");
This works well as long as the bullet list consist of single digit items. 只要子弹列表由单个数字项组成,这就可以正常工作。 However, as soon as I try multiple digit bullet lists, it won't work (example
12. item c 13. item d
) I tried altering the the pattern to 但是,一旦我尝试多位数子弹列表,它将无法工作(例如
12. item c 13. item d
)我尝试将模式更改为
Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");
or 要么
Pattern p1 = Pattern.compile("\\s[\\d]\\+[\\.\\)]\\s");
My interpretation of the regex language is that this would match any case where there are 1 or more digits preceding a ".". 我对正则表达式语言的解释是,这将匹配“。”之前有1位或更多位数的任何情况。 But this doesn't work.
但这不起作用。
Can anyone see what I'm doing wrong? 谁能看到我做错了什么?
Pattern p1 = Pattern.compile("\\s[\\d]+[\\.\\)]\\s");
(your second version) should work, but you can simplify it: (你的第二个版本)应该可以工作,但你可以简化它:
Pattern p1 = Pattern.compile("\\s\\d+[.)]\\s");
However, it does expect whitespace before the digit (so it won't match at the start of the string, for example). 但是,它确实期望数字前面的空格(例如,它在字符串的开头不匹配)。 Perhaps a word boundary is useful here:
也许字边界在这里很有用:
Pattern p1 = Pattern.compile("\\b\\d+[.)]\\s");
(FYI: Your third example was trying to match a literal +
after a single digit. That's why it failed). (FYI:您的第三个例子是想匹配一个
+
一个数字后,这就是为什么它失败了。)。
一个更简单的正则表达式(未测试):
\\s(\\d+)[.)]\\s
I assume the problem is that there's not always whitespace in front of the digits. 我假设问题是数字前面并不总是有空格。 Thus change the expression to (Java string version)
"\\\\s*\\\\d+[\\\\.\\\\)]\\\\s"
. 因此将表达式更改为(Java字符串版本)
"\\\\s*\\\\d+[\\\\.\\\\)]\\\\s"
。
Example: 例:
10. aaa //no whitespace before 10 here, thus the leading whitespace has to be optional
11. bbb //here the whitespace should match the new line which counts as whitespace
As for the lower case character version: 至于小写字符版本:
"\\s*(?:\\d+|[a-z]+)[\\.\\)]\\s"
where (?:\\\\d+|[az]+)
means "a sequence of either digits or lower case characters. 其中
(?:\\\\d+|[az]+)
表示“一个数字或小写字符的序列。
Note that this would still match 123a.
请注意,这仍然匹配
123a.
even though only the a.
即使只有
a.
part would be matched. 部分将匹配。 To allow only bullet points in a line, add
"(?:^|\\\\n)"
(Java string again) at the beginning of the expression, which means the match must either start at the beginning of the text or after a line break. 要仅允许一行中的项目符号,请在表达式的开头添加
"(?:^|\\\\n)"
(再次使用Java字符串),这意味着匹配必须从文本的开头开始,也可以在一行之后开始打破。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.