How to handle nested parentheses with regex?

Question

I came up with a regex string that parses the given text into 3 categories:

in parentheses
in brackets
neither.

Like this:

\[.+?\]|\(.+?\)|[\w+ ?]+

My intention is to use the outermost operator only. So, given a(b[c]d)e , the split is going to be:

a || (b[c]d) || e

It works fine given parentheses inside brackets, or brackets inside parentheses, but breaks down when there are brackets inside brackets and parentheses inside parentheses. For example, a[b[c]d]e is split as

a || [b[c] || d || ] || e.

Is there any way to handle this using regex alone, not resorting to using code to count number of open/closed parentheses? Thanks!

Answer 1

Standard ¹ regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.

¹ I said standard , but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
    print "$_\n";
}

[hello [world]]
abc
[123]
[xyz jkl]

EDIT: I see you're using Python; check out pyparsing .

Answer 2

Well, once you abandon the idea that parsing nested expressions should work at unlimited depth, one can use regular expressions just fine by specifying a maximum depth in advance. Here is how:

def nested_matcher (n):
    # poor man's matched paren scanning, gives up after n+1 levels.
    # Matches any string with balanced parens or brackets inside; add
    # the outer parens yourself if needed.  Nongreedy.  Does not
    # distinguish parens and brackets as that would cause the
    # expression to grow exponentially rather than linearly in size.
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n

import re

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')

This will output

['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']

How to handle nested parentheses with regex?

Question

2 answers

solution1
10 ACCPTED 2013-06-29 20:44:02

solution2
1 2014-04-02 12:43:27

How to handle nested parentheses with regex?

Question

2 answers

solution1 10 ACCPTED 2013-06-29 20:44:02

solution2 1 2014-04-02 12:43:27

solution1
10 ACCPTED 2013-06-29 20:44:02

solution2
1 2014-04-02 12:43:27