简体   繁体   中英

Regex to follow pattern except between braces

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:

For example: Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0) . This should match the first part of the strings:


    
    3123123
    123123031230
    etc.

But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. ie, ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish} . So this would capture everything in bold:


    312{bar}3120123

and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.


    123

EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.

Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.

https://regex101.com/r/iFSabs/3

Here's the rundown on how the regex works:

  • ^ anchors the match to start at the beginning of the line
  • 0 matches a literal zero character
  • ([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
    • [123]+ matches one or more instances of 1 , 2 , or 3
    • (?:{[^}]*}[123]*)* is a non-capturing group. Ie it'll be part of the match, but won't have a $# for use in replacement or the match.
      • {[^}]*} matches a literal { followed by any number of non } characters followed by }
      • [123]* matches zero or more instances of 1 , 2 , or 3
      • Then this whole non-capturing group can be matched 0 or more times.

The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)

The unrolling the loop technique is based on the hypothesis that in most case, you [know] in a [repeated] alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

Which could means something like, match the normal case, if you find a special case, matched it than match the normal case again. [You'll] notice that part of this syntax could [potentially] lead to a super-linear match.

Example using Regex#test and Regex#match:

 const strings = [ '0213123123130', '012312312312303123123', '01231230123123031230', '01213123123123{21310030123012301}31231230123', '01212121{hello 0}121312', '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123', '012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123', ]; const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/ console.log('tests') console.log(strings.map(string => `'${string}': ${regex.test(string)}`)) console.log('matches'); let matches = strings.map((string) => regex.exec(string)).map((match) => (match? match[1]: undefined)); console.log(matches);

Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ( {[^}]*} ) for a bit more performance.

How about the other way around? Checking the string with curly tags removed:

const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');

const result = /^(0[123]+0)/.test(stringWithoutTags);

You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:

^(0(?:[123]|{.+?})+0)

That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.

https://regex101.com/r/K4teGY/2

You might use

^0[123]*(?:{[^{}]*}[123]*)*0
  • ^ Start of string
  • 0 Match a zero
  • [123]* Match 0+ times either 1, 2 or 3
  • (?: Non capture group
    • {[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
  • )* Close group and repeat 0+ times
  • 0 Match a zero

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM