I came up with a regex string that parses the given text into 3 categories:
Like this:
\[.+?\]|\(.+?\)|[\w+ ?]+
My intention is to use the outermost operator only. So, given a(b[c]d)e
, the split is going to be:
a || (b[c]d) || e
It works fine given parentheses inside brackets, or brackets inside parentheses, but breaks down when there are brackets inside brackets and parentheses inside parentheses. For example, a[b[c]d]e
is split as
a || [b[c] || d || ] || e.
Is there any way to handle this using regex alone, not resorting to using code to count number of open/closed parentheses? Thanks!
Standard 1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.
1 I said standard , but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:
$str = "[hello [world]] abc [123] [xyz jkl]";
my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;
foreach (@matches) {
print "$_\n";
}
[hello [world]] abc [123] [xyz jkl]
EDIT: I see you're using Python; check out pyparsing
.
Well, once you abandon the idea that parsing nested expressions should work at unlimited depth, one can use regular expressions just fine by specifying a maximum depth in advance. Here is how:
def nested_matcher (n):
# poor man's matched paren scanning, gives up after n+1 levels.
# Matches any string with balanced parens or brackets inside; add
# the outer parens yourself if needed. Nongreedy. Does not
# distinguish parens and brackets as that would cause the
# expression to grow exponentially rather than linearly in size.
return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n
import re
p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')
This will output
['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.