简体   繁体   中英

Regex match with lookbehind and lookahead with named groups

I'm trying to match on the following text:

"abc" matches "b" and field[cba] = "cba" or (field[cba] matches "c") and "cc" = "bb"

the parts before and after "matches" into named groups.

I need to match "abc" as ${left} and "b" as ${right} , and then "field[cba]" / "c" on the second match.

I need to give bounds to ${left} and ${right} so that they break if:

Left:

  • should be preceded by any of: " and " , " or " , "(" when not in doublequotes (")
  • if none of those is present then it could be the start of the string

Right:

  • should be followed by any of: " and " , " or " , ")" when not in doublequotes (")
  • if none of those is present then it could be the end of the string

The replacement regex pattern I would like to use is:

RegExpMatch(${left}, ${right})

So to get the following output:

RegExpMatch("abc","b") and field[cba] = "cba" or (RegExpMatch(field[cba],"c")) and "cc" = "bb"

I tried with:

(?<=^|\\(| or | and )(?<left>.*?) matches (?<right>.*?)(?=\\)|$| and | or )

This has a couple of issues:

  • using ^ for start of string seems to make the lookbehind greedy and it captures from start of string even if there is an " or " or " and " before, which is weird because $ seems to work ok
  • I don't know how to tell the " or " , " and " , "(" or ")" to match only when not in quotes (in a literal)

Can you please help me in figuring out the correct regular pattern to apply?

The problem is it sees and in your lookahead, and then you use .*? (which will suck up everything until matches : field[cba] = "cba" or (field[cba] ). We need a more strict definition of left/right, it can't just be "any character".

(?<=^|\(| or | and )(?<left>\S+) matches (?<right>\S+?)(?=\)|$| and | or )

I changed .*? to \\S+ which matches anything but whitespace ( [^\\r\\n\\t\\f ] ). Now it won't suck up all the unnecessary characters in left/right capture groups. \\S+ may not be the right definition for you, but it should get you started.

Demo: Regex101

I'm not entirely sure how your data is, but I suggest this regex, which is independent of the bounds:

(?:(?<left>"[^"]*")|\b(?<left>\S*)) matches (?:(?<right>"[^"]*")|(?<right>\S*[^)\s]))

I'm exploiting the fact that C# allows captures with the same name here. The left and right parts are almost the same.

(?:            => Non-capture group
  (?<left>     => Left capture begin
    "[^"]*"    => Double quotes, non-quote characters then double quotes
  )            => End left capture 
|              => OR
  \b           => Word boundary
  (?<left>     => Begin other left capture if first failed
    \S*        => Capture non-space characters (if your parts break on multiple lines, you can use [^"]* instead
  )            => End left capture
)              => End non-capture group

regex101 demo (I changed the named captures because PCRE doesn't support same name capture groups)

If the word boundary is causing problems (eg when you have a part that doesn't start with " or a \\w character, you might use the following regex instead:

(?:(?<left>"[^"]*")|\s\(?(?<left>\S*)) matches (?:(?<right>"[^"]*")|(?<right>\S*[^)\s]))

Which is using \\s\\(? instead of the \\b


If you want to stick to the bounds you mentioned, you will have to know what exactly can be in the parts or what cannot. For instance, if

field["abc"] in field matches field["cba"] in field

is valid and the parts are field["abc"] in field and field["cba"] in field respectively, then it's another complication.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM