简体   繁体   English

python:用BNF或pyparsing替换正则表达式

[英]python: replacing regex with BNF or pyparsing

I am parsing a relatively simple text, where each line describes a game unit. 我正在解析一个相对简单的文本,其中每一行描述一个游戏单元。 I have little knowledge of parsing techniques, so I used the following ad hoc solution: 我对解析技术知之甚少,所以我使用了以下特殊解决方案:

class Unit:
    # rules is an ordered dictionary of tagged regex that is intended to be applied in the given order
    # the group named V would correspond to the value (if any) for that particular tag
    rules = (
        ('Level', r'Lv. (?P<V>\d+)'),
        ('DPS', r'DPS: (?P<V>\d+)'),
        ('Type', r'(?P<V>Tank|Infantry|Artillery'),
        #the XXX will be expanded into a list of valid traits
        #note: (XXX| )* wouldn't work; it will match the first space it finds,
        #and stop at that if it's in front of something other than a trait
        ('Traits', r'(?P<V>(XXX)(XXX| )*)'),
        # flavor text, if any, ends with a dot
        ('FlavorText', r'(?P<V>.*\."?$)'),
        )
    rules = collections.OrderedDict(rules)
    traits = '|'.join('All-Terrain', 'Armored', 'Anti-Aircraft', 'Motorized')
    rules['Traits'] = re.sub('XXX', effects, rules['Traits'])

    for x in rules:
        rules[x] = re.sub('<V>', '<'+x+'>', rules[x])
        rules[x] = re.compile(rules[x])

    def __init__(self, data)
        # data looks like this:
        # Lv. 5 Tank DPS: 55 Motorized Armored
        for field, regex in Item.rules.items():
            data = regex.sub(self.parse, data, 1)
        if data:
            raise ParserError('Could not parse part of the input: ' + data)

    def parse(self, m):
        if len(m.groupdict()) != 1:
            Exception('Expected a single named group')
        field, value = m.groupdict().popitem()
        setattr(self, field, value)
        return ''

It works fine, but I feel I reached the limit of regex power. 它工作正常,但我觉得我达到了正则表达式的极限。 Specifically, in the case of Traits, the value ends up being a string that I need to split and convert into a list at a later point: eg, obj.Traits would be set to 'Motorized Armored' in this code, but in a later function changed to ('Motorized', 'Armored'). 具体来说,在Traits的情况下,该值最终成为我需要拆分并在以后转换为列表的字符串:例如,obj.Traits将在此代码中设置为“Motorized Armored”,但在后来的功能改为('Motorized','Armored')。

I'm thinking of converting this code to use either EBNF or pyparsing grammar or something like that. 我正在考虑将此代码转换为使用EBNF或pyparsing语法或类似的东西。 My goals are: 我的目标是:

  • make this code neater and less error-prone 使这个代码更整洁,更不容易出错
  • avoid the ugly treatment of the case with a list of values (where I need do replacement inside the regex first, and later post-process the result to convert a string into a list) 避免使用值列表对案例进行丑陋处理(我需要先在正则表达式中进行替换,然后对结果进行后处理以将字符串转换为列表)

What would be your suggestions about what to use, and how to rewrite the code? 您对使用什么以及如何重写代码有什么建议?

PS I skipped some parts of the code to avoid clutter; PS我跳过代码的某些部分以避免混乱; if I introduced any errors in the process, sorry - the original code does work :) 如果我在这个过程中引入了任何错误,抱歉 - 原始代码确实有效:)

I started to write up a coaching guide for pyparsing, but looking at your rules, they translate pretty easily into pyparsing elements themselves, without dealing with EBNF, so I just cooked up a quick sample: 我开始写一篇关于pyparsing的教练指南,但是看看你的规则,他们很容易将它们转换成pyparsing元素本身,而不用处理EBNF,所以我只是编写了一个快速的样本:

from pyparsing import Word, nums, oneOf, Group, OneOrMore, Regex, Optional

integer = Word(nums)
level = "Lv." + integer("Level")
dps = "DPS:" + integer("DPS")
type_ = oneOf("Tank Infantry Artillery")("Type")
traits = Group(OneOrMore(oneOf("All-Terrain Armored Anti-Aircraft Motorized")))("Traits")
flavortext = Regex(r".*\.$")("FlavorText")

rule = (Optional(level) & Optional(dps) & Optional(type_) & 
        Optional(traits) & Optional(flavortext))

I included the Regex example so you could see how a regular expression could be dropped in to an existing pyparsing grammar. 我包含了Regex示例,因此您可以看到如何将正则表达式放入现有的pyparsing语法中。 The composition of rule using '&' operators means that the individual items could be found in any order (so the grammar takes care of the iterating over all the rules, instead of you doing it in your own code). 使用'&'运算符的rule组合意味着可以按任何顺序找到单个项目(因此语法负责迭代所有规则,而不是在您自己的代码中执行)。 Pyparsing uses operator overloading to build up complex parsers from simple ones: '+' for sequence, '|' Pyparsing使用运算符重载来构建简单的解析器:'+'表示序列,'|' and '^' for alternatives (first-match or longest-match), and so on. 和'^'代替替代品(第一场比赛或最长比赛),依此类推。

Here is how the parsed results would look - note that I added results names, just as you used named groups in your regexen: 以下是解析结果的外观 - 请注意我添加了结果名称,就像在regexen中使用命名组一样:

data = "Lv. 5 Tank DPS: 55 Motorized Armored"

parsed_data = rule.parseString(data)
print parsed_data.dump()
print parsed_data.DPS
print parsed_data.Type
print ' '.join(parsed_data.Traits)

prints: 打印:

['Lv.', '5', 'Tank', 'DPS:', '55', ['Motorized', 'Armored']]
- DPS: 55
- Level: 5
- Traits: ['Motorized', 'Armored']
- Type: Tank
55
Tank
Motorized Armored

Please stop by the wiki and see the other examples. 请访问维基并查看其他示例。 You can easy_install to install pyparsing, but if you download the source distribution from SourceForge, there is a lot of additional documentation. 您可以通过easy_install来安装pyparsing,但是如果从SourceForge下载源代码发布,则还有许多其他文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM