简体   繁体   中英

Regular expression curvy brackets in Python

I have a string like this:

a = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'

Currently I am using this re statement to get desired result:

filter(None, re.split("\\{(.*?)\\}", a))

But this gives me:

['CGPoint={CGPoint=d{CGPoint=dd', '}}', 'CGSize=dd', 'dd', 'CSize=aa']

which is incorrect for my current situation, I need a list like this:

['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']

As @m.buettner points out in the comments, Python's implementation of regular expressions can't match pairs of symbols nested to an arbitrary degree. (Other languages can, notably current versions of Perl.) The Pythonic thing to do when you have text that regexs can't parse is to use a recursive-descent parser.

There's no need to reinvent the wheel by writing your own, however; there are a number of easy-to-use parsing libraries out there. I recommend pyparsing which lets you define a grammar directly in your code and easily attach actions to matched tokens. Your code would look something like this:

import pyparsing

lbrace = Literal('{')
rbrace = Literal('}')  
contents = Word(printables)
expr = Forward()
expr << Combine(Suppress(lbrace) + contents + Suppress(rbrace) + expr)

for line in lines:
    results = expr.parseString(line)

There's an alternative regex module for Python I really like that supports recursive patterns: https://pypi.python.org/pypi/regex

pip install regex

Then you can use a recursive pattern in your regex as demonstrated in this script:

import regex
from pprint import pprint


thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'

theregex = r'''
    (
        {
            (?<match>
                [^{}]*
                (?:
                    (?1)
                    [^{}]*
                )+
                |
                [^{}]+
            )
        }
        |
        (?<match>
            [^{}]+
        )
    )
'''

matches = regex.findall(theregex, thestr, regex.X)

print 'all matches:\n'
pprint(matches)

print '\ndesired matches:\n'
print [match[1] for match in matches]

This outputs:

all matches:

[('{CGPoint={CGPoint=d{CGPoint=dd}}}', 'CGPoint={CGPoint=d{CGPoint=dd}}'),
 ('{CGSize=dd}', 'CGSize=dd'),
 ('dd', 'dd'),
 ('{CSize=aa}', 'CSize=aa')]

desired matches:

['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']

pyparsing has a nestedExpr function for matching nested expressions:

import pyparsing as pp

ident = pp.Word(pp.alphanums)
expr = pp.nestedExpr("{", "}") | ident

thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
for result in expr.searchString(thestr):
    print(result)

yields

[['CGPoint=', ['CGPoint=d', ['CGPoint=dd']]]]
[['CGSize=dd']]
['dd']
[['CSize=aa']]

Here is some pseudo code. It creates a stack of strings and pops them when a close brace is encountered. Some extra logic to handle the fact that the first braces encountered are not included in the array.

String source = "{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}";
Array results;
Stack stack;

foreach (match in source.match("[{}]|[^{}]+")) {
    switch (match) {
        case '{':
           if (stack.size == 0) stack.push(new String()); // add new empty string
           else stack.push('{'); // child, so include matched brace.
        case '}':
           if (stack.size == 1) results.add(stack.pop()) // clear stack add to array
           else stack.last += stack.pop() + '}"; // pop from stack and concatenate to previous
        default:
           if (stack.size == 0) results.add(match); // loose text, add to results
           else stack.last += match;  // append to latest member.
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM