[英]Python: extract text from tag inside tag in XML Tree
I am currently parsing a Wikipedia dump, trying to extract some useful information. 我目前正在解析Wikipedia转储,试图提取一些有用的信息。 The parsing takes place in XML, and I want to extract only the text / content for each page. 解析以XML进行,我只想提取每个页面的文本/内容。 Now I'm wondering how you can find all text inside a tag that is inside another tag. 现在,我想知道如何在另一个标签内的标签内找到所有文本。 I searched for similar questions, but only found the ones having problems with a singular tag. 我搜索了类似的问题,但只发现那些带有单个标记的问题。 Here is an example of what I want to achieve: 这是我要实现的示例:
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
<example_tag>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</example_tag>
How can I extract the text inside the text tag, but only when it is included in the revision tree? 如何提取文本标签内的文本,但仅当文本包含在修订树中时才可以提取文本?
You can use the xml.etree.elementtree
package for that and use an XPath query: 您可以xml.etree.elementtree
使用xml.etree.elementtree
包,并使用XPath查询:
import xml.etree.ElementTree as ET
root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
# ... process content, for instance
print(content.text)
(where the_xml_string
is a string containing the XML code). (其中, the_xml_string
是包含XML代码的字符串)。
Or obtain a list of the text elements with list comprehension: 或通过列表理解获得文本元素的列表:
import xml.etree.ElementTree as ET
texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]
So the .text
has the inner text. 因此, .text
具有内部文本。 Note that you will have to replace othertag
with the tag (for instance text
). 请注意,您将必须用标签替换othertag
(例如text
)。 If that tag can be arbitrary deep in the revision
tag, you should use .//revision//othertag
as XPath query. 如果该标签在revision
标签中可以是任意深度,则应使用.//revision//othertag
作为XPath查询。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.