简体   繁体   English

使用pyparsing解析嵌套结构

[英]parsing nested structures with pyparsing

I'm trying to parse a particular syntax for positions in biological sequences. 我正在尝试解析生物序列中位置的特定语法。 The positions can have forms like: 这些职位可以有以下形式:

12           -- a simple position in the sequence
12+34        -- a complex position as a base (12) and offset(+34)
12_56        -- a range, from 12 to 56
12+34_56-78  -- a range as a start to end, where either or both may be simple or complex

I'd like to have these parsed as dicts, roughly like this: 我希望将这些解析为dicts,大致如下:

12          -> { 'start': { 'base': 12, 'offset': 0 },  'end': None }
12+34       -> { 'start': { 'base': 12, 'offset': 34 }, 'end': None }
12_56       -> { 'start': { 'base': 12, 'offset': 0 },
                   'end': { 'base': 56, 'offset': 0 } }
12+34_56-78 -> { 'start': { 'base': 12, 'offset': 0 }, 
                   'end': { 'base': 56, 'offset': -78 } }

I've made several stabs using pyparsing. 我用pyparsing做了几次刺伤。 Here's one: 这是一个:

from pyparsing import *
integer = Word(nums)
signed_integer = Word('+-', nums)
underscore = Suppress('_')
position = integer.setResultsName('base') + Or(signed_integer,Empty).setResultsName('offset')
interval = position.setResultsName('start') + Or(underscore + position,Empty).setResultsName('end')

The results are close to what I want: 结果接近我想要的结果:

In [20]: hgvspyparsing.interval.parseString('12-34_56+78').asDict()
Out[20]: 
{'base': '56',
'end': (['56', '+78'], {'base': [('56', 0)], 'offset': [((['+78'], {}), 1)]}),
'offset': (['+78'], {}),
'start': (['12', '-34'], {'base': [('12', 0)], 'offset': [((['-34'], {}), 1)]})}

Two questions: 两个问题:

  1. asDict() only worked on the root parseResult. asDict()仅适用于根parseResult。 Is there a way to cajole pyparsing into returning a nested dict (and only that)? 有没有办法哄骗pyparsing返回一个嵌套的dict(只有那个)?

  2. How do I get the optionality of the end of a range and the offset of a position? 如何获得范围结束和位置偏移的可选性? The Or() in the position rule doesn't cut it. 位置规则中的Or()不会削减它。 (I tried similarly for the end of the range.) Ideally, I'd treat all positions as special cases of the most complex form (ie, { start: {base, end}, end: { base, end } }), where the simpler cases use 0 or None.) (我在范围的末尾尝试了类似的方法。)理想情况下,我会将所有位置视为最复杂形式的特殊情况(即{start:{base,end},end:{base,end}}),更简单的情况下使用0或None。)

Thanks! 谢谢!

Some general pyparsing tips: 一些一般的pyparsing技巧:

Or(expr, empty) is better written as Optional(expr) . Or(expr, empty)最好写为Optional(expr) Also, your Or expression was trying to create an Or with the class Empty, you probably meant to write Empty() or empty for the second argument. 此外,您的Or表达式尝试使用类Empty创建Or,您可能打算为第二个参数写入Empty()empty

expr.setResultsName("name") can now be written as expr("name") expr.setResultsName("name")现在可以写为expr("name")

If you want to apply structure to your results, use Group . 如果要将结构应用于结果,请使用“ Group

Use dump() instead of asDict() to better view the structure of your parsed results. 使用dump()而不是asDict()来更好地查看已解析结果的结构。

Here is how I would build up your expression: 这是我如何建立你的表达式:

from pyparsing import Word, nums, oneOf, Combine, Group, Optional

integer = Word(nums)

sign = oneOf("+ -")
signedInteger = Combine(sign + integer)

integerExpr = Group(integer("base") + Optional(signedInteger, default="0")("offset"))

integerRange = integerExpr("start") + Optional('_' + integerExpr("end"))


tests = """\
12
12+34
12_56
12+34_56-78""".splitlines()

for t in tests:
    result = integerRange.parseString(t)
    print t
    print result.dump()
    print result.asDict()
    print result.start.base, result.start.offset
    if result.end:
        print result.end.base, result.end.offset
    print

Prints: 打印:

12
[['12', '0']]
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]})}
12 0

12+34
[['12', '+34']]
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]})}
12 +34

12_56
[['12', '0'], '_', ['56', '0']]
- end: ['56', '0']
  - base: 56
  - offset: 0
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]}), 'end': (['56', '0'], {'base': [('56', 0)], 'offset': [('0', 1)]})}
12 0
56 0

12+34_56-78
[['12', '+34'], '_', ['56', '-78']]
- end: ['56', '-78']
  - base: 56
  - offset: -78
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]}), 'end': (['56', '-78'], {'base': [('56', 0)], 'offset': [('-78', 1)]})}
12 +34
56 -78

Is the actual syntax more complicated than your examples? 实际语法是否比您的示例更复杂? Because the parsing can be done fairly easily in pure Python: 因为解析可以在纯Python中相当容易地完成:

bases = ["12", "12+34", "12_56", "12+34", "12+34_56-78"]

def parse_base(base_string):

    def parse_single(s):
        if '-' in s:
            offset_start = s.find("-")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        elif '+' in s:
            offset_start = s.find("+")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        else:
            base = int(s)
            offset = 0
        return {'base': base, 'offset': offset}

    range_split = base_string.split('_')
    if len(range_split) == 1:
        start = range_split[0]
        return {'start': parse_single(start), 'end': None}
    elif len(range_split) == 2:
        start, end = range_split
        return {'start': parse_single(start),
                'end': parse_single(end)}

Output: 输出:

for b in bases:
     print(parse_base(b))

{'start': {'base': 12, 'offset': 0}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 0}, 'end': {'base': 56, 'offset': 0}}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': {'base': 56, 'offset': -78}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM