简体   繁体   English

Python正则表达式匹配或标记化

[英]Python regexp matching or tokenizing

I have a dump of a data structure which i'm trying to convert into an XML. 我正在尝试转换为XML的数据结构的转储。 the structure has a number of nested structures within it. 该结构内部有许多嵌套结构。 So i'm kind of lost on how to start because all the regex expressions that i can think of will not work on nested expressions. 所以我有点不知道如何开始,因为我能想到的所有正则表达式都不适用于嵌套表达式。

For example, let's say there is a structure dump like this: 例如,假设有一个这样的结构转储:

abc = (  
        bcd = (efg = 0, ghr = 5, lmn = 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))

and i want to come out with an output like this: 我想给出这样的输出:

< abc >
  < bcd >   
    < efg >0< /efg >  
    < ghr >5< /ghr >  
    < lmn >10< /lmn >  
  < /bcd >  
.....  
< /abc > 

So what would be a good approach to this? 那么什么是一个好的方法呢? tokenizing the expression, a clever regex or using a stack? 标记表达式,巧妙的正则表达式还是使用堆栈?

Use pyparsing. 使用pyparsing。

$ cat parsing.py 
from pyparsing import nestedExpr

abc = """(  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))"""
print nestedExpr().parseString(abc).asList()

$ python parsing.py
[['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]

Here is an alternate answer that uses pyparsing more idiomatically. 这是一个更惯用的pyparsing的替代答案。 Because it provides a detailed grammar for what inputs may be seen and what results should be returned, parsed data is not "messy." 因为它为可能看到的输入和应返回的结果提供了详细的语法,所以解析的数据不是“混乱的”。 Thus toXML() needn't work as hard nor do any real cleanup. 因此, toXML()不需要那么费劲,也不需要任何真正的清理。

print "\n----- ORIGINAL -----\n"

dump = """
abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
""".strip()

print dump


print "\n----- PARSED INTO LIST -----\n"

from pyparsing import Word, alphas, nums, Optional, Forward, delimitedList, Group, Suppress

def Syntax():
    """Define grammar and parser."""

    # building blocks
    name   = Word(alphas)
    number = Word(nums)
    _equals = Optional(Suppress('='))
    _lpar   = Suppress('(')
    _rpar   = Suppress(')')

    # larger constructs
    expr = Forward()
    value = number | Group( _lpar + delimitedList(expr) + _rpar )
    expr << name + _equals + value

    return expr

parsed = Syntax().parseString(dump)
print parsed


print "\n----- SERIALIZED INTO XML ----\n"


def toXML(part, level=0):

    xml = ""
    indent = "    " * level
    while part:
        tag     = part.pop(0)
        payload = part.pop(0)

        insides = payload if isinstance(payload, str) \
                          else "\n" + toXML(payload, level+1) + indent

        xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())

    return xml

print toXML(parsed)

The input and XML output is the same as my other answer. 输入和XML输出与我的其他答案相同。 The data returned by parseString() is the only real change: parseString()返回的数据是唯一的实际更改:

----- PARSED INTO LIST -----

['abc', ['bcd', ['efg', '0', 'ghr', '5', 'lmn', '10'], 'ghd', '5', 'zde',
['dfs', '10', 'fge', '20', 'dfg', ['sdf', '3', 'ert', '5'], 'juh', '0']]]

I don't think regexps is the best approach here, but for those curious it can be done like this: 我认为regexps不是最好的方法,但是对于那些好奇的人,可以这样进行:

def expr(m):
    out = []
    for item in m.group(1).split(','):
        a, b = map(str.strip, item.split('='))
        out.append('<%s>%s</%s>' % (a, b, a))
    return '\n'.join(out)

rr = r'\(([^()]*)\)'
while re.search(rr, data):
    data = re.sub(rr, expr, data)

Basically, we repeatedly replace lowermost parenthesis (no parens here) with chunks of xml until there's no more parenthesis. 基本上,我们用xml块重复替换最低括号(no parens here) ,直到没有括号为止。 For simplicity, I also included the main expression in parenthesis, if this is not the case, just do data='(%s)' % data before parsing. 为简单起见,我还在括号中包含了主表达式,如果不是这种情况,只需在解析之前执行data='(%s)' % data

I like Igor Chubin's "use pyparsing" answer, because in general, regexps handle nested structures very poorly (though thg435's iterative replacement solution is a clever workaround). 我喜欢Igor Chubin的“ use pyparsing”答案,因为总的来说,正则表达式处理嵌套结构的能力很差(尽管thg435的迭代替换解决方案是一个聪明的解决方法)。

But once pyparsing's done its thing, you then need a routine to walk the list and emit XML. 但是一旦pyparsing完成了它的工作,您就需要一个例程来遍历列表并发出XML。 It needs to be intelligent about the imperfections of pyparsing's results. 对于pyparsing结果的不完善之处,它必须是明智的。 For example, fge =20, doesn't yield the ['fge', '=', '20'] you'd like, but ['fge', '=20,'] . 例如, fge =20,不会产生您想要的['fge', '=', '20'] ,但是会产生['fge', '=20,'] Commas are sometimes also added in places that are unhelpful. 有时还会在无用的地方添加逗号。 Here's how I did it: 这是我的操作方式:

from pyparsing import nestedExpr

dump = """
abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
"""

dump = dump.strip()

print "\n----- ORIGINAL -----\n"
print dump

wrapped = dump if dump.startswith('(') else "({})".format(dump)
parsed = nestedExpr().parseString(wrapped).asList()

print "\n----- PARSED INTO LIST -----\n"
print parsed

def toXML(part, level=0):

    def grab_tag():
        return part.pop(0).lstrip(",")

    def grab_payload():
        payload = part.pop(0)
        if isinstance(payload, str):
            payload = payload.lstrip("=").rstrip(",")
        return payload

    xml = ""
    indent = "    " * level
    while part:
        tag     = grab_tag() or grab_tag()
        payload = grab_payload() or grab_payload()
        # grab twice, possibly, if '=' or ',' is in the way of what you're grabbing

        insides = payload if isinstance(payload, str) \
                          else "\n" + toXML(payload, level+1) + indent

        xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())

    return xml

print "\n----- SERIALIZED INTO XML ----\n"
print toXML(parsed[0])

Resulting in: 导致:

----- ORIGINAL -----

abc = (  
        bcd = (efg = 0, ghr = 5, lmn 10), 
        ghd = 5, 
        zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))

----- PARSED INTO LIST -----

[['abc', '=', ['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]]

----- SERIALIZED INTO XML ----

<abc>
    <bcd>
        <efg>0</efg>
        <ghr>5</ghr>
        <lmn>10</lmn>
    </bcd>
    <ghd>5</ghd>
    <zde>
        <dfs>10</dfs>
        <fge>20</fge>
        <dfg>
            <sdf>3</sdf>
            <ert>5</ert>
        </dfg>
        <juh>0</juh>
    </zde>
</abc>

You can use re module to parse nested expressions (though it is not recommended): 您可以使用re模块来解析嵌套表达式(尽管不建议这样做):

import re

def repl_flat(m):
    return "\n".join("<{0}>{1}</{0}>".format(*map(str.strip, s.partition('=')[::2]))
                     for s in m.group(1).split(','))

def eval_nested(expr):
    val, n = re.subn(r"\(([^)(]+)\)", repl_flat, expr)
    return val if n == 0 else eval_nested(val)

Example

print eval_nested("(%s)" % (data,))

Output 产量

<abc><bcd><efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn></bcd>
<ghd>5</ghd>
<zde><dfs>10</dfs>
<fge>20</fge>
<dfg><sdf>3</sdf>
<ert>5</ert></dfg>
<juh>0</juh></zde></abc>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM