简体   繁体   English

如何将XML文档解析为Python对象?

[英]How can I parse an XML document into a Python object?

I'm trying to consume an XML API. 我正在尝试使用XML API。 I'd like to have some Python objects that represent the XML data. 我想要一些表示XML数据的Python对象。 I have several XSD and some example API responses from the documentation. 我有一些XSD和文档中的一些示例API响应。

Here's one example XML response: 这是一个XML响应示例:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
                         xmlns:title="http://www.isan.org/schema/v1.11/common/title"
                         xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
                         xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
                         xmlns:common="http://www.isan.org/schema/v1.11/common/common"
                         xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
                         xmlns:language="http://www.isan.org/schema/v1.11/common/language"
                         xmlns:country="http://www.isan.org/schema/v1.11/common/country">
    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    <serial:SerialHeaderId root="0000-0002-3B9F"/>
    <serial:MainTitles>
        <title:TitleDetail>
            <title:Title>Braquo</title:Title>
            <title:Language>
                <language:LanguageLabel>French</language:LanguageLabel>
                <language:LanguageCode>
                    <language:CodingSystem>ISO639_2</language:CodingSystem>
                    <language:ISO639_2Code>FRE</language:ISO639_2Code>
                </language:LanguageCode>
            </title:Language>
            <title:TitleKind>ORIGINAL</title:TitleKind>
        </title:TitleDetail>
    </serial:MainTitles>
    <serial:TotalEpisodes>11</serial:TotalEpisodes>
    <serial:TotalSeasons>0</serial:TotalSeasons>
    <serial:MinDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>45</common:TimeValue>
    </serial:MinDuration>
    <serial:MaxDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>144</common:TimeValue>
    </serial:MaxDuration>
    <serial:MinYear>2009</serial:MinYear>
    <serial:MaxYear>2009</serial:MaxYear>
    <serial:MainParticipantList>
        <participant:Participant>
            <participant:FirstName>Frédéric</participant:FirstName>
            <participant:LastName>Schoendoerffer</participant:LastName>
            <participant:RoleCode>DIR</participant:RoleCode>
        </participant:Participant>
        <participant:Participant>
            <participant:FirstName>Karole</participant:FirstName>
            <participant:LastName>Rocher</participant:LastName>
            <participant:RoleCode>ACT</participant:RoleCode>
        </participant:Participant>
    </serial:MainParticipantList>
    <serial:CompanyList>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>R.T.B.F.</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Capa Drama</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Marathon</common:CompanyName>
        </common:Company>
    </serial:CompanyList>
</serial:serialHeaderType>

I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. 我试着简单地忽略XSD并在从API获得的XML上使用lxml.objectify I had a problem with namespaces. 我对名称空间有疑问。 Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code. 必须使用其显式命名空间来引用每个子节点是一个真正的难题,而且也不是可读代码。

from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace

So then I tried generateDS to create some Python class definitions for me. 因此,然后我尝试使用generateDS为我创建一些Python类定义。 I've lost the error messages that this attempt gave me but I couldn't get it to work. 我已经丢失了此尝试给我的错误消息,但我无法使其正常工作。 It would generate a module for each XSD that I gave it but it wouldn't parse the example XML. 它会为我提供的每个XSD生成一个模块,但不会解析示例XML。

I'm now trying pyxb and this seems much nicer so far. 我现在正在尝试pyxb ,到目前为止看起来更好。 It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML: 它生成的定义比generateDS更好(将它们拆分为多个可重用的模块),但它不会解析XML:

from models import serial
obj = serial.CreateFromDocument(response)

Traceback (most recent call last):
  ...
  File "/vagrant/isan/isan.py", line 58, in lookup
    return serial.CreateFromDocument(resp.content)
  File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
    instance = handler.rootObject()
  File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>

The unrecognised node is the <serial:serialHeaderType> node from the example. 无法识别的节点是示例中的<serial:serialHeaderType>节点。 Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it. 查看pyxb源代码,似乎该错误来自“如果顶级元素作为DOM实例处理”,但是我不知道这意味着什么或如何防止它。

I've run out of steam for trying to explore this, I don't know what to do next. 我已经精疲力竭地尝试探索这一点,我不知道下一步该怎么做。

I have had a lot of luck parsing XML into Python using Beautiful Soup. 我很幸运使用Beautiful Soup将XML解析为Python。 It is extremely straightforward, and they provide pretty strong documentation. 这非常简单,而且它们提供了非常强大的文档。 Check it out here: http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 在此处查看: http : //www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/

UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. UnrecognizedDOMRootNodeError表示PyXB无法在其绑定已注册的名称空间中找到该元素。 In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType . 就您而言,它在第一个元素{http://www.isan.org/schema/v1.21/common/serial}serialHeaderType上失败。

The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType . 名称空间架构定义了一个名为SerialHeaderType的complexType,但未定义名称为serialHeaderType的元素。 In fact it defines no top-level elements. 实际上,它没有定义任何顶级元素。 So PyXB can't recognize it, and the XML does not validate. 因此PyXB无法识别它,并且XML无法验证。

Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. 您可能需要找到提供元素的名称空间的其他架构,或者您发送的消息实际上无法验证。 That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name. 那可能是因为有人期望从复杂类型到具有该类型的元素的隐式映射,或者是因为它通常是在某个其他元素(其中QName是成员元素名称)中发现的一个片段。

UPDATE : You can hand-craft an element in that namespace by adding the following to the generated bindings in serial.py: 更新 :您可以通过将以下内容添加到serial.py中生成的绑定中来手工创建该命名空间中的元素:

serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)

If you do that, you won't get the UnrecognizedDOMRootNodeError but you will get an IncompleteElementContentError at: 如果执行此操作,则不会收到UnrecognizedDOMRootNodeError,但会在以下位置收到IncompleteElementContentError

<common:status>
    <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
    <common:ISAN root="0000-0002-3B9F"/>
    <common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>

which provides the following details: 其中提供以下详细信息:

The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
    An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
    An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
    An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed

Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required. 查看架构可确认至少缺少{http://www.isan.org/schema/v1.11/common/common}Description元素,但它是必需的。

So it seems these documents are not meant to be validated, and PyXB is probably the wrong technology to use. 因此,似乎这些文件并不意味着要经过验证,PyXB可能是使用错误的技术。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM