简体   繁体   English

Python:如何处理涉及转义的复杂多行正则表达式?

[英]Python: how to do this complex multiline regex involving escapes?

I have a file that looks like this: 我有一个看起来像这样的文件:

...

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '544'
  issues:
  - plumbing: fixed
    ref:
    - id: 28
      cost: 23 USD

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '545'
  issues:
  - plumbing: fixed
    ref:
    - id: 1081
      cost: 33 USD

 ...

This file has hundreds of similar entries on other families. 该文件在其他系列上有数百个类似的条目。

I want to make it look like this: 我想使它看起来像这样:

- family:
  - home: house
    location: 53rd street|Austin|Texas|U.S
    type: old
original entry: '544'
  issues:
  - plumbing: fixed
    ref:
    - id: 28
      cost: 23 USD
    - id: 1081
      cost: 33 USD

I have tried making a multiline regex where I just find the text in the middle and replace it with nothing. 我尝试制作多行正则表达式,在其中只找到中间的文本,然后将其替换为空。 Here is the pattern I attempted: 这是我尝试的模式:

pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street|Austin|Texas|U.S\n\s+type: old\n\original entry: \'554\'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"

This did not seem to work. 这似乎没有用。 I tried one of those online regex tools that suggested: 我尝试了其中一种建议的在线正则表达式工具:

pattern = "r'\s- family:\n\s+- home: house\n\s+tag: 53rd street\|Austin\|Texas\|U.S\n\s+type: old\n\original entry: '554'\n\s+issues:\n\s+- plumbing: fixed\n\s+ref:'"

This also did not appear to work. 这似乎也不起作用。 I have used my multiline regex function on simpler cases without a problem, so I know the regex code itself works. 我已经在较简单的情况下使用了多行正则表达式函数,没有问题,因此我知道正则表达式代码本身可以工作。 It is just that it seems a bit tricky getting a pattern that works. 只是要获得有效的模式似乎有些棘手。

I figure there must be some stuff that is not getting escaped correctly, or escaped too much. 我认为某些东西可能无法正确逃逸,或者逃逸太多。 Also, this strategy does not seem to get both of the original entry numbers after each other. 另外,此策略似乎并没有获得两个原始条目号。

Is there a way this can be done? 有办法吗? I guess one can just use the entire two blocks as the pattern, and the result as the replacement text, but that seems even more bulkier and difficult... 我猜一个人可以只使用整个两个块作为模式,而结果则作为替换文本,但这似乎更加庞大和困难。

The parser for doing this using pyparser is uncomplicated. 使用pyparser进行解析的解析器并不复杂。 Here, it's declared as the name p . 在这里,它被声明为名称p Each line is defined to be everything up to an end-line followed by an end-line, and the entire file consists of OneOrMore of these. 每一行都定义为直到结束行后跟结束行的所有内容,并且整个文件由其中的一个OneOrMore多个组成。 Since pyparsing ignores white space by default the empty lines disappear. 由于pyparsing默认会忽略空格,因此空行会消失。

>>> import pyparsing as pp
>>> theFile = open('temp.txt').read()
>>> p = pp.OneOrMore(pp.Combine(pp.restOfLine+pp.Suppress('\n')))
>>> for item in p.parseString(theFile):
...     item
... 
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '544'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 28'
'cost: 23 USD'
'- family:'
'- home: house'
'location: 53rd street|Austin|Texas|U.S'
'type: old'
"original entry: '545'"
'issues:'
'- plumbing: fixed'
'ref:'
'- id: 1081'
'cost: 33 USD'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM