简体   繁体   English

Python:从XML树中的标签内部的标签中提取文本

[英]Python: extract text from tag inside tag in XML Tree

I am currently parsing a Wikipedia dump, trying to extract some useful information. 我目前正在解析Wikipedia转储,试图提取一些有用的信息。 The parsing takes place in XML, and I want to extract only the text / content for each page. 解析以XML进行,我只想提取每个页面的文本/内容。 Now I'm wondering how you can find all text inside a tag that is inside another tag. 现在,我想知道如何在另一个标签内的标签内找到所有文本。 I searched for similar questions, but only found the ones having problems with a singular tag. 我搜索了类似的问题,但只发现那些带有单个标记的问题。 Here is an example of what I want to achieve: 这是我要实现的示例:

  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>

  <example_tag>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </example_tag>

How can I extract the text inside the text tag, but only when it is included in the revision tree? 如何提取文本标签内的文本,但仅当文本包含在修订树中时才可以提取文本?

You can use the xml.etree.elementtree package for that and use an XPath query: 您可以xml.etree.elementtree使用xml.etree.elementtree包,并使用XPath查询:

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

(where the_xml_string is a string containing the XML code). (其中, the_xml_string是包含XML代码的字符串)。

Or obtain a list of the text elements with list comprehension: 或通过列表理解获得文本元素的列表:

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

So the .text has the inner text. 因此, .text具有内部文本。 Note that you will have to replace othertag with the tag (for instance text ). 请注意,您将必须用标签替换othertag (例如text )。 If that tag can be arbitrary deep in the revision tag, you should use .//revision//othertag as XPath query. 如果该标签在revision标签中可以是任意深度,则应使用.//revision//othertag作为XPath查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM