简体   繁体   中英

Recursive regex for matching everything in parenthesis (PCRE)

I am surprised to not easily find a similar question with an answer on SO. I would like to match everything in some functions. The idea is to remove the functions which are useless.

foo(some (content)) --> some (content)

So I am trying to match everything in the function call which can include parenthesis. Here is my PCRE regex:

(?<name>\w+)\s*\(\K
(?<e>
     [^()]+
     |
     [^()]*
         \((?&e)\)
     [^()]*
)*
(?=\))

https://regex101.com/r/gfMAIM/1

Unfortunately it doesn't work and I don't really understand why.

Your Group e pattern does not do the right job, currently, it matches parentheses with 1 depth level as you only recursed the e pattern once. It needs to match as many (...) substrings as there are present, and thus, the subroutine pattern needs to be inside a * or + quantified group, and it can even be "simplified" to (?<e>[^()]*(?:\\((?&e)\\)[^()]*)*) .

Note that your Group e pattern is equal to (?<e>[^()]+|\\((?&e)\\))* . [^()]* around \\((?&e)\\) are redundant since the [^()]+ alternative will consume the chars other than ( and ) on the current depth level.

Also, you quantified the Group e pattern making it a repeated capturing group that only keeps the text matched during the last iteration.

You may use

(?<name>\w+)\s*\(\K(?<e>[^()]*(?:\((?&e)\)[^()]*)*)(?=\))

See the regex demo

Details

  • (?<name>\\w+)\\s*\\(\\K - 1+ word chars, 0+ whitespaces and ( that are omitted from the match
  • (?<e> - start of Group e
    • [^()]* - 0+ chars other than ( and )
    • (?: - start of a non-capturing group:
      • \\( - a ( char
      • (?&e) - Group e pattern recursed
      • \\) - a )
      • [^()]* - 0+ chars other than ( and )
    • )* - 0 or more repetitions
  • ) - end of e group
  • (?=\\)) - a ) must be immediately to the right of the current location.

The following regex does the matching without taking extra steps:

(?<name>\w+)\s*(\((?<e>([^()]*+|(?2))+)\))

See live demo here

But that doesn't match following strings that contain unbalanced parentheses in a quoted string:

  • foo(bar = ')')
  • foo(bar(john = "(Doe..."))

So what you should look for is:

(?<name>\w+)\s*(\((?<e>([^()'"]*+|"(?>[^"\\]*+|\\.)*"|'(?>[^'\\]*+|\\.)*'|(?2))+)\))

See live demo here

Regex breakdown:

  • (?<name>\\w+)\\s* Match function name and trailing spaces
  • ( Start of a cluster
    • \\( Match a literal (
    • (?<e> Start of named capturing group e
      • ( Start of capturing group #2
        • [^()'"]*+ Match any thing except ()'"
        • | Or
        • "(?>[^"\\\\]*+|\\\\.)*" Match any thing between double quotes
        • | Or
        • '(?>[^'\\\\]*+|\\\\.)*' Match any thing between single quotes
        • | Or
        • (?2) Recurse second capturing group
      • )+ Repeat as much as possible, at least once
    • ) End of capturing group
    • \\) Match ) literally
  • ) End of capturing group

I have simple regex without recursion .

(?<=[\w ]{2}\().*(?=\))

by now it deals with unbalanced perenthesis, but it does not deals with multiple functions that are on one line. It could be handeled if you know the delmiters between the function. eg ; if that is Java code.

Variant 2 (updated for multiple functions on a row):

(?<=[\w ]\()[^;\n]*(?=\))

Variant 3 (allowing ; in strings):

(?<=[\w ]\()([^;\n]|".*?")*(?=\))    

Variant 4 (escaping strings):

(?<=[\w \n]\()(?:[^;\n"]|(?:"(?:[^"]|\\")*?(?<!\\)"))*(?=\))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM