[英]python regex: match words in a multiline pattern
I have text that contains several xml blocks with metadata above it, like this: 我的文本包含几个上面带有元数据的xml块,如下所示:
Block 1
2017-02-01 12:00
<?xml version="1.0" encoding="UTF-8"?>
<block>
<elt>text</elt>
<elt>more text</elt>
<block>
<elt>words</elt>
</block>
</block>
Block 2
2017-02-01 12:15
<?xml version="1.0" encoding="UTF-8"?>
<block>
<block>
<elt>text</elt>
<block>
<elt>words</elt>
</block>
<elt>more text</elt>
</block>
<elt>word</elt>
</block>
I need to pull out the xml text and skip over the metadata. 我需要提取xml文本并跳过元数据。 I can do it iteratively like this:
我可以这样反复进行:
messages = []
while True:
start = xml.find('<?xml')
if start == -1:
break
xml = xml[start:]
end = xml.find('\n\n')
if end == -1:
messages.append(xml)
break
else:
messages.append(xml[:end])
xml = xml[end:]
But I'd like to use a regular expression instead. 但我想改用正则表达式。 The problem I'm having is that I need to be able to match either 2 consecutive line breaks (
\\n\\n
) or the end of the string ( \\Z
). 我遇到的问题是我需要能够匹配两个连续的换行符(
\\n\\n
)或字符串的结尾( \\Z
)。 I'm having trouble there. 我在那儿遇到麻烦。 I've tried this:
我已经试过了:
re.findall('<\?xml.*?[\n\n|\Z]', xml, re.DOTALL)
but I just get ['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n']
. 但是我只得到
['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n']
。
I've used \\b
in the past to match words, but that gives no change: 我过去使用
\\b
来匹配单词,但这并没有改变:
>>> re.findall('<\?xml.*?[(\b\n\n\b)|\Z]', xml, re.DOTALL)
['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']
I can't figure out how to make it work. 我不知道如何使它工作。
You're trying to match end of string OR 2 newlines in a character class []
. 您正在尝试匹配字符类
[]
的字符串结尾或2个换行符。 That doesn't work. 那不行
I'd match them in a forward lookup (doesn't consume or create groups unlike standard grouping parentheses, so findall
returns the whole string) 我会在正向查找中匹配它们(不消耗或创建与标准分组括号不同的组,因此
findall
返回整个字符串)
re.findall('<\?xml.*?(?=\n\n|\Z)', xml, re.DOTALL)
Another good workaround for this would be to match the last </block>
, starting on a new line: 另一个不错的解决方法是从最后一行匹配最后一个
</block>
:
re.findall('<\?xml.*?\n</block>', xml, re.DOTALL)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.