简体   繁体   中英

Error in substituting '(' with regex in Python

Hi have the following string:

s = r'aaa (bbb (ccc)) ddd'

and I would like to find and replace the innermost nested parentheses with {} . Wanted output:

s = r'aaa (bbb {ccc}) ddd'

Let's start with the nested ( . I use the following regex in order to find nested parentheses, which works pretty good:

match = re.search(r'\([^\)]+(\()', s)
print(match.group(1))
(

Then I try to make the substitution:

re.sub(match.group(1), r'\{', s)

but I get the following error:

error: missing ), unterminated subpattern at position 0

I really don't understand what's wrong.

You can use

import re
s = r'aaa (bbb (ccc)) ddd'
print( re.sub(r'\(([^()]*)\)', r'{\1}', s) )
# => aaa (bbb {ccc}) ddd

See the Python demo .

Details :

  • \( - a ( char
  • ([^()]*) - Group 1 ( \1 ): any zero or more chars other than ( and )
  • \) - a ) char.

The replacement is a Group 1 value wrapped with curly braces.

With your shown samples and attempts, please try following code in Python, written and tested in Python3.x. Also here is the Online demo for used regex in code.

import re
var = r'aaa (bbb (ccc)) ddd'
print( re.sub(r'(^.*?\([^(]*)\(([^)]*)\)(.*)', r'\1{\2}\3', var) )

Output for shown samples, will be as follows:

aaa (bbb {ccc}) ddd

Explanation of Python code:

  • Using python's re library here for regex.
  • Creating a variable named var which has value aaa (bbb (ccc)) ddd in it.
  • Then using print function of python3 to print value which we get from re.sub function which is performing substitution for us to get required output.

Explanation of re.sub section: Basically we are using regex (^.*?\([^(]*)\(([^)]*)\)(.*) (explained below) which creates 3 capturing groups(only to get required values), where 1st capturing group captures value just before ( which is present before ccc and 2nd capturing group has ccc in it and 3rd capturing group has rest of the value in it. While performing substitution we are simply substituting it with \1{\2}\3 and wrapping value ccc within {..}

Explanation of regex:

(^.*?\([^(]*)  ##Creating 1st capturing group which matches values from starting of value to till first occurrence of ( 
               ##with a Lazy match followed by a match which matches anything just before next occurrence of (
\(             ##Matching literal ( here, NO capturing group here as we DO NOT want this in output.
([^)]*)        ##Creating 2nd capturing group which has everything just before next occurrence of ) in it.
\)             ##Matching literal ) here, NO capturing group here as we DO NOT want this in output.
(.*)           ##Creating 3rd capturing group which has rest values in it.

You've gotten the argument order wrong:

sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the Match object and must return a replacement string to be used.

The pattern comes first, but because you've given it match.group(1) , it's seeing '(' as the pattern, which contains unmatched and unescaped parentheses.

I think what you are after is something like:

re.sub(r'\([^\)]+(\()', r'\1{', s)
'aaa ({ccc)) ddd'

As indicated by your reply to my comment on the question, the following example strings are to be transformed as indicated:

'(aaa) (bbb (ccc)) ddd'                => '(aaa) (bbb {ccc}) ddd'
'(aaa (eee)) (bbb ccc) ddd'            => '(aaa {eee}) (bbb ccc) ddd'
'(aaa) (ee (ff (gg))) (bbb (ccc)) ddd' => '(aaa) (ee (ff {gg})) (bbb {ccc}) ddd'

We cannot obtain those results with a single regular expression but we can do so by executing a sequence of the regular expressions

r'\(([^()]*)\)(?=(?:[^()]*\)){n})'

for n = 0, 1, .... and substituting matches with

r'{\1}'

If n = N is the smallest value of n for which there is no match, the desired substitution is given by the string produced by n = N-1 .


I have assumed the string has balanced parentheses .

The strings 'a(b X c)' and 'a(b(c(d X e)f)g)' have balanced parentheses; 'a(b(c(d X e)fg)' and 'a(b))cd X ((efg)' do not.

The nesting level of any character in the string equals the number of right parentheses that follow before a left parenthesis is encountered (equivalently, the number of left parentheses that precede before a right parenthesis is encountered). The nesting levels of 'X' in the following strings are as shown:

String          Nesting level
_____________________________
a X b                 0
a(b X c)              1
a(b(c X d)e)f         2
a(b(c(d X e)f)(g))    3

Consider the string

'(aaa) (ee (ff (gg))) (bbb (ccc)) ddd'

We first set n = 0 to obtain

r'\(([^()]*)\)(?=(?:[^()]*\)){0})'

Demo 0 shows that the substitution of matches produces the string

{aaa} (ee (ff {gg})) (bbb {ccc}) ddd

Now set n = 1 to produce the regular expression

\(([^()]*)\)(?=(?:[^()]*\)){1})

Demo 1 shows that the substitution of matches produces the string

(aaa) (ee (ff {gg})) (bbb {ccc}) ddd

Next set n = 2 to produce the regular expression

\(([^()]*)\)(?=(?:[^()]*\)){2})

Demo 2 and Python demo show that the substitution of matches produces the string

(aaa) (ee (ff {gg})) (bbb (ccc)) ddd

Next set n = 3 to produce the regular expression

\(([^()]*)\)(?=(?:[^()]*\)){3})

Demo 3 shows that there are no matches. We therefore conclude that n = 2 is the greatest level of nested parentheses, so the desired substituted string must be that produced when n = 2 :

(aaa) (ee (ff {gg})) (bbb (ccc)) ddd

Demo 4 illustrates that there may be ties.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM