简体   繁体   中英

Java regex for matching various types of section headers

I am trying to create a regex (for use in Java) with which to match potential section headings. The section headings can be either roman numerals (up to 39) or things like "A.3. 10", "3.4", "4", "34.A", etc. But it seems to be matching just either roman numerals or just the other part, even though I'm using alternation in the regex. I'm testing via https://regexr.com/ .

This is my regex:

(\b(?:(?:X{0,3}(?:I[V|X]|V?I{0,3}))|(?:(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}))\b)

Each part (roman numeral vs. letter+digit) seems to be behaving correctly by itself:

roman numeral:
(\bX{0,3}(?:I[V|X]|V?I{0,3})\b)

letter+digit:
(\b(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b)

Here are some test cases that should match:

Section A.3
Section A . 34
Section 3 . A
Section 1.2.5.6
Section 1.2.5
Section 1.2. 5
Section 1 . 2 . 5
Section III
Section  XVI
Section IX
Section 3.B
Section 35.C
Section A.B.34
Section 3
Section 34
Section C
Section 34.35
Section A.3.C
Section 3.A.5

Here are some that should not pass:

A common phrase is this though..
Section AB.34
Section AB.5
Section CD
Section 345

Can someone please tell me what I'm missing?

My requirements ended up a bit different than the details in the original question. Below are the 2 regexes I ended up using for my 2 different use cases, as well as the test examples they are passing:

1) https://regex101.com/r/D9sQGz/2

(\b(?<!\w)(?:[0-9]{1,2}|[A-Z])(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b(?<=\w))

2) https://regex101.com/r/v0NjW6/2

(\b(?<!\w)X{0,3}(?:I[VX]|V?I{0,3})\b(?<=\w)(?:\s?\.\s?(?:[0-9]{1,2}|[A-Z])){0,2}\b(?<=\w))

The pattern for letter+digit matches also two of the bad cases:

Section AB.34 -->
Section AB.<word boundary>34<word boundary>

Try to get the individual regexes correct an then test again!

There should be no problem in combining two regexes like this (as done in your code):

(?:(?:regex1)|(?:regex2))

Also consider to use two regexes and do the or operation in Java code. This is simpler to understand for someone who has to read your code later.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM