简体   繁体   English

Java 正则表达式,用于匹配各种类型的节标题

[英]Java regex for matching various types of section headers

I am trying to create a regex (for use in Java) with which to match potential section headings.我正在尝试创建一个正则表达式(用于 Java)来匹配潜在的部分标题。 The section headings can be either roman numerals (up to 39) or things like "A.3. 10", "3.4", "4", "34.A", etc. But it seems to be matching just either roman numerals or just the other part, even though I'm using alternation in the regex.部分标题可以是罗马数字(最多 39)或“A.3.10”、“3.4”、“4”、“34.A”等。但它似乎只匹配罗马数字或者只是另一部分,即使我在正则表达式中使用了交替。 I'm testing via https://regexr.com/ .我正在通过https://regexr.com/进行测试。

This is my regex:这是我的正则表达式:

(\b(?:(?:X{0,3}(?:I[V|X]|V?I{0,3}))|(?:(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}))\b)

Each part (roman numeral vs. letter+digit) seems to be behaving correctly by itself:每个部分(罗马数字与字母+数字)似乎本身都表现正确:

roman numeral:
(\bX{0,3}(?:I[V|X]|V?I{0,3})\b)

letter+digit:
(\b(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b)

Here are some test cases that should match:以下是一些应该匹配的测试用例:

Section A.3
Section A . 34
Section 3 . A
Section 1.2.5.6
Section 1.2.5
Section 1.2. 5
Section 1 . 2 . 5
Section III
Section  XVI
Section IX
Section 3.B
Section 35.C
Section A.B.34
Section 3
Section 34
Section C
Section 34.35
Section A.3.C
Section 3.A.5

Here are some that should not pass:以下是一些不应该通过的:

A common phrase is this though..
Section AB.34
Section AB.5
Section CD
Section 345

Can someone please tell me what I'm missing?有人可以告诉我我错过了什么吗?

My requirements ended up a bit different than the details in the original question.我的要求最终与原始问题中的细节有所不同。 Below are the 2 regexes I ended up using for my 2 different use cases, as well as the test examples they are passing:以下是我最终用于 2 个不同用例的 2 个正则表达式,以及它们通过的测试示例:

1) https://regex101.com/r/D9sQGz/2 1) https://regex101.com/r/D9sQGz/2

(\b(?<!\w)(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b(?<=\w))

2) https://regex101.com/r/v0NjW6/2 2) https://regex101.com/r/v0NjW6/2

(\b(?<!\w)X{0,3}(?:I[VX]|V?I{0,3})\b(?<=\w)(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b(?<=\w))

The pattern for letter+digit matches also two of the bad cases:字母+数字的模式也匹配两个坏情况:

Section AB.34 -->
Section AB.<word boundary>34<word boundary>

Try to get the individual regexes correct an then test again!尝试让各个正则表达式正确,然后再次测试!

There should be no problem in combining two regexes like this (as done in your code):像这样组合两个正则表达式应该没有问题(如您的代码中所做的那样):

(?:(?:regex1)|(?:regex2))

Also consider to use two regexes and do the or operation in Java code.还可以考虑使用两个正则表达式并在 Java 代码中执行 or 操作。 This is simpler to understand for someone who has to read your code later.对于以后必须阅读您的代码的人来说,这更容易理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM