python regex：以多行模式匹配单词

Question

I have text that contains several xml blocks with metadata above it, like this: 我的文本包含几个上面带有元数据的xml块，如下所示：

Block 1
2017-02-01 12:00
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <elt>text</elt>
 <elt>more text</elt>
 <block>
  <elt>words</elt>
 </block>
</block>

Block 2
2017-02-01 12:15
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <block>
  <elt>text</elt>
  <block>
   <elt>words</elt>
  </block>
  <elt>more text</elt>
 </block>
 <elt>word</elt>
</block>

I need to pull out the xml text and skip over the metadata. 我需要提取xml文本并跳过元数据。 I can do it iteratively like this: 我可以这样反复进行：

messages = []
while True:
 start = xml.find('<?xml')
 if start == -1:
  break
 xml = xml[start:]
 end = xml.find('\n\n')
 if end == -1:
  messages.append(xml)
  break
 else:
  messages.append(xml[:end])
  xml = xml[end:]

But I'd like to use a regular expression instead. 但我想改用正则表达式。 The problem I'm having is that I need to be able to match either 2 consecutive line breaks ( \\n\\n ) or the end of the string ( \\Z ). 我遇到的问题是我需要能够匹配两个连续的换行符（ \\n\\n ）或字符串的结尾（ \\Z ）。 I'm having trouble there. 我在那儿遇到麻烦。 I've tried this: 我已经试过了：

re.findall('<\?xml.*?[\n\n|\Z]', xml, re.DOTALL)

but I just get ['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n'] . 但是我只得到['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n'] 。

I've used \\b in the past to match words, but that gives no change: 我过去使用\\b来匹配单词，但这并没有改变：

>>> re.findall('<\?xml.*?[(\b\n\n\b)|\Z]', xml, re.DOTALL)
['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']

I can't figure out how to make it work. 我不知道如何使它工作。

Answer 1

You're trying to match end of string OR 2 newlines in a character class [] . 您正在尝试匹配字符类[]的字符串结尾或2个换行符。 That doesn't work. 那不行

I'd match them in a forward lookup (doesn't consume or create groups unlike standard grouping parentheses, so findall returns the whole string) 我会在正向查找中匹配它们（不消耗或创建与标准分组括号不同的组，因此findall返回整个字符串）

re.findall('<\?xml.*?(?=\n\n|\Z)', xml, re.DOTALL)

Another good workaround for this would be to match the last </block> , starting on a new line: 另一个不错的解决方法是从最后一行匹配最后一个</block> ：

re.findall('<\?xml.*?\n</block>', xml, re.DOTALL)

python regex：以多行模式匹配单词

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-02-01 20:02:47

python regex：以多行模式匹配单词

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-02-01 20:02:47

解决方案1
1 已采纳 2017-02-01 20:02:47