简体   繁体   English

在Python中遍历没有名称空间的XML树

[英]Traversing an XML tree without namespace in Python

I am parsing a large XML file, which essentially contains a table. 我正在解析一个大型XML文件,该文件本质上包含一个表。 The nodes in the XML don't always have names. XML中的节点并不总是具有名称。 Nested deep within several tags is what is basically an HTML-like table with <TD> s containing raw (numeric) data within row ( <TR> ) tags. 嵌套在几个标签内的基本上是一个类似HTML的表,其中<TD>包含行( <TR> )标签内的原始(数字)数据。 Now before I can iterate through to the table there is a whole bunch of metadata tags that I'm not interested in. For instance: 现在,在我可以遍历表之前,有一堆我不感兴趣的元数据标签。例如:

<?xml version="1.0" ?>
<soap:Envelope xmlns:soap="--ommitted--" xmlns:xsi="--ommitted--">
    <soap:Body>
        <FetchReportResponse xmlns="URL1">
            <FetchReportResult xmlns="URL2">
                <REPORT>
                    <TITLE>CROSS VISITING REPORT</TITLE>
                    <SUBTITLE/>
                    <SUMMARY>
                        <GEOGRAPHY>--ommitted--</GEOGRAPHY>
                        <LOCATION>--ommitted--</LOCATION>
                        <TIMEPERIOD>--ommitted--</TIMEPERIOD>
                        <TARGET>--ommitted--</TARGET>
                        <MEDIA>--ommitted--</MEDIA>
                        <DATE>--ommitted--</DATE>
                        <USER>--ommitted--</USER>
                    </SUMMARY>
                    <TABLE>
                        <THEAD>
                            <TR>
                              <TH>--ommitted--</TH>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>

I am new to XML parsing so I'm following this . 我是XML解析的新手,所以我一直关注这一点 I have the following code to read and XML file and create an ElementTree object. 我有以下代码来读取XML文件并创建ElementTree对象。

import xml.etree.ElementTree as ET

tree = ET.parse('./../filename.xml')
print(root.find("./"))

This understandably prints the following: 可以理解,这将打印以下内容:

<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x00000230CAC23318>

However, when I try to use the XPath convention to traverse it from here on, I'm unable to. 但是,当我尝试使用XPath约定从现在开始遍历它时,我无能为力。 For instance, 例如,

print(root.find("./Body"))

prints None , even though <Body> is clearly nested inside <Envelope> . 即使<Body>显然嵌套在<Envelope>内,也None打印。

EDIT: Following Mark Tolonen's answer I was able to get to the Body tag, but how do I get beyond that? 编辑:按照马克·托隆宁(Mark Tolonen)的回答,我能够进入“ Body标签,但是我如何才能超越此范围? More specifically, I want to reach the <TABLE> tag. 更具体地说,我想到达<TABLE>标记。

您需要完全限定的名称,因为它是soap:Body ,所以您想使用xmlns:soap值限定主体,该值(从您的Envelope示例中暗示)是:

print(root.find("./{http://schemas.xmlsoap.org/soap/envelope/}Body"))

In addition to the XPath section, you also need to pay attention to the Namespaces section of the documentation , since your XML contains various namespaces, with and without prefix (the latter known as default namespace). 除了XPath部分,您还需要注意文档的Namespaces部分 ,因为您的XML包含带有和不带有前缀的各种名称空间(后者称为默认名称空间)。 Notice that TABLE element inherits namespace from the nearest ancestor with default namespace: FetchReportResult . 请注意, TABLE元素从具有默认命名空间FetchReportResult的最近祖先继承了命名空间。 So to find TABLE you need to use the default namespace URI "URL2" , either using curly braces syntax or using prefix-URI dictionary : 因此,要查找TABLE您需要使用大括号语法或前缀URI字典,使用默认的名称空间URI "URL2"

ns = { "u2": "URL2" }
tables = root.findall(".//u2:TABLE", ns)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM