简体   繁体   中英

python regex: match words in a multiline pattern

I have text that contains several xml blocks with metadata above it, like this:

Block 1
2017-02-01 12:00
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <elt>text</elt>
 <elt>more text</elt>
 <block>
  <elt>words</elt>
 </block>
</block>

Block 2
2017-02-01 12:15
<?xml version="1.0" encoding="UTF-8"?>
<block>
 <block>
  <elt>text</elt>
  <block>
   <elt>words</elt>
  </block>
  <elt>more text</elt>
 </block>
 <elt>word</elt>
</block>

I need to pull out the xml text and skip over the metadata. I can do it iteratively like this:

messages = []
while True:
 start = xml.find('<?xml')
 if start == -1:
  break
 xml = xml[start:]
 end = xml.find('\n\n')
 if end == -1:
  messages.append(xml)
  break
 else:
  messages.append(xml[:end])
  xml = xml[end:]

But I'd like to use a regular expression instead. The problem I'm having is that I need to be able to match either 2 consecutive line breaks ( \\n\\n ) or the end of the string ( \\Z ). I'm having trouble there. I've tried this:

re.findall('<\?xml.*?[\n\n|\Z]', xml, re.DOTALL)

but I just get ['<?xml version="1.0" encoding="UTF-8"?>\\n', '<?xml version="1.0" encoding="UTF-8"?>\\n'] .

I've used \\b in the past to match words, but that gives no change:

>>> re.findall('<\?xml.*?[(\b\n\n\b)|\Z]', xml, re.DOTALL)
['<?xml version="1.0" encoding="UTF-8"?>\n', '<?xml version="1.0" encoding="UTF-8"?>\n']

I can't figure out how to make it work.

You're trying to match end of string OR 2 newlines in a character class [] . That doesn't work.

I'd match them in a forward lookup (doesn't consume or create groups unlike standard grouping parentheses, so findall returns the whole string)

re.findall('<\?xml.*?(?=\n\n|\Z)', xml, re.DOTALL)

Another good workaround for this would be to match the last </block> , starting on a new line:

re.findall('<\?xml.*?\n</block>', xml, re.DOTALL)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM