无法使用Python正则表达式匹配XML元素

Question

I have an XML document with the following structure- 我有一个具有以下结构的XML文档-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code- 我对捕获其中包含的文本感兴趣，为此，我编写了以下Python 3正则表达式代码-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. 但是'body_text'总是返回一个空列表。 However, when I try to capture the text for the tags ... using code- 但是，当我尝试使用代码捕获标签的文本时，

category_text = re.findall(r'(.+)', xml_doc) category_text = re.findall（r'（。+）'，xml_doc）

This does the job. 这样就可以了。 Any idea(s) as to why the ... XML element code is not working? 关于... XML元素代码为何不起作用的任何想法？

Thanks! 谢谢！

Answer 1

The special character . 特殊字符. will not match a newline, so that regex will not match a multiline string. 将不匹配换行符，因此正则表达式将不匹配多行字符串。

You can change this behavior by specifying the DOTALL flag. 您可以通过指定DOTALL标志来更改此行为。 To specify that flag you can include this at the start of your regular expression: (?s) 要指定该标志，您可以在正则表达式的开头添加该标志：（ (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax 有关Python正则表达式语法的更多信息，请参见： https : //docs.python.org/3/library/re.html#regular-expression-syntax

Answer 2

You can use re.DOTALL 您可以使用re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output: 输出：

[" Postmodernism preceded by Modernism '' Postmodernity\n> Postchristianity Postmodern philosophy Postmodern architecture\n> Postmodern art Postmodernist film Postmodern literature Postmodern\n> music Postmodern theater Critical theory Globalization Consumerism\n> "]

无法使用Python正则表达式匹配XML元素

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-11-10 00:23:02

解决方案2
1 2018-11-10 01:26:00

无法使用Python正则表达式匹配XML元素

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-11-10 00:23:02

解决方案2 1 2018-11-10 01:26:00

解决方案1
2 已采纳 2018-11-10 00:23:02

解决方案2
1 2018-11-10 01:26:00