简体   繁体   English

python regex:以多行模式匹配单词

[英]python regex: match words in a multiline pattern

I have text that contains several xml blocks with metadata above it, like this: 我的文本包含几个上面带有元数据的xml块,如下所示:

Block 1
2017-02-01 12:00
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <elt>text</elt>
 <elt>more text</elt>
 <block>
  <elt>words</elt>
 </block>
</block>

Block 2
2017-02-01 12:15
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <block>
  <elt>text</elt>
  <block>
   <elt>words</elt>
  </block>
  <elt>more text</elt>
 </block>
 <elt>word</elt>
</block>

I need to pull out the xml text and skip over the metadata. 我需要提取xml文本并跳过元数据。 I can do it iteratively like this: 我可以这样反复进行:

messages = []
while True:
 start = xml.find('<?xml')
 if start == -1:
  break
 xml = xml[start:]
 end = xml.find('\n\n')
 if end == -1:
  messages.append(xml)
  break
 else:
  messages.append(xml[:end])
  xml = xml[end:]

But I'd like to use a regular expression instead. 但我想改用正则表达式。 The problem I'm having is that I need to be able to match either 2 consecutive line breaks ( \\n\\n ) or the end of the string ( \\Z ). 我遇到的问题是我需要能够匹配两个连续的换行符( \\n\\n )或字符串的结尾( \\Z )。 I'm having trouble there. 我在那儿遇到麻烦。 I've tried this: 我已经试过了:

re.findall('<\?xml.*?[\n\n|\Z]', xml, re.DOTALL)

but I just get ['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n'] . 但是我只得到['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n']

I've used \\b in the past to match words, but that gives no change: 我过去使用\\b来匹配单词,但这并没有改变:

>>> re.findall('<\?xml.*?[(\b\n\n\b)|\Z]', xml, re.DOTALL)
['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']

I can't figure out how to make it work. 我不知道如何使它工作。

You're trying to match end of string OR 2 newlines in a character class [] . 您正在尝试匹配字符类[]的字符串结尾或2个换行符。 That doesn't work. 那不行

I'd match them in a forward lookup (doesn't consume or create groups unlike standard grouping parentheses, so findall returns the whole string) 我会在正向查找中匹配它们(不消耗或创建与标准分组括号不同的组,因此findall返回整个字符串)

re.findall('<\?xml.*?(?=\n\n|\Z)', xml, re.DOTALL)

Another good workaround for this would be to match the last </block> , starting on a new line: 另一个不错的解决方法是从最后一行匹配最后一个</block>

re.findall('<\?xml.*?\n</block>', xml, re.DOTALL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM