简体   繁体   English

好的python XML解析器可以处理命名空间繁重的文档

[英]Good python XML parser to work with namespace heavy documents

Python elementTree seems unusable with namespaces. Python elementTree似乎无法使用命名空间。 What are my alternatives? 我有什么选择? BeautifulSoup is pretty rubbish with namespaces too. BeautifulSoup也很容易使用名称空间。 I don't want to strip them out. 我不想把它们剥掉。

Examples of how a particular python library gets namespaced elements and their collections are all +1. 特定python库如何获取命名空间元素及其集合的示例均为+1。

Edit: Could you provide code to deal with this real world use-case using your library of choice? 编辑:您能否使用您选择的库提供代码来处理这个真实世界的用例?

How would you go about getting strings 'Line Break', '2.6' and a list ['PYTHON', 'XML', 'XML-NAMESPACES'] 你将如何获得字符串'Line Break','2.6'和列表['PYTHON','XML','XML-NAMESPACES']

<?xml version="1.0" encoding="UTF-8"?>
<zs:searchRetrieveResponse
    xmlns="http://unilexicon.com/vocabularies/"
    xmlns:zs="http://www.loc.gov/zing/srw/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:lom="http://ltsc.ieee.org/xsd/LOM">
    <zs:records>
        <zs:record>
            <zs:recordData>
                <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema">
                    <name>Line Break</name>
                    <dc:title>Processing XML namespaces using Python</dc:title>
                    <dc:description>How to get contents string from an element,
                        how to get a collection in a list...</dc:description>
                    <lom:metaMetadata>
                        <lom:identifier>
                            <lom:catalog>Python</lom:catalog>
                            <lom:entry>2.6</lom:entry>
                        </lom:identifier>
                    </lom:metaMetadata>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>PYTHON</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>XML</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>XML-NAMESPACES</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                </srw_dc:dc>
            </zs:recordData>
        </zs:record>
        <!-- ... more records ... -->
    </zs:records>
</zs:searchRetrieveResponse>

lxml is namespace-aware. lxml是名称空间感知的。

>>> from lxml import etree
>>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""")
>>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
'<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>'
>>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
[<Element {foo}bar at ...>]

Edit: On your example: 编辑:在您的示例中:

from lxml import etree

# remove the b prefix in Python 2
# needed in python 3 because
# "Unicode strings with encoding declaration are not supported."
et = etree.XML(b"""...""")

ns = {
    'lom': 'http://ltsc.ieee.org/xsd/LOM',
    'zs': 'http://www.loc.gov/zing/srw/',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'voc': 'http://www.schooletc.co.uk/vocabularies/',
    'srw_dc': 'info:srw/schema/1/dc-schema'
}

# according to docs, .xpath returns always lists when querying for elements
# .find returns one element, but only supports a subset of XPath
record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
# in this example, we know there's only one record
# but else, you should apply the following to all elements the above returns

name = record.xpath("//voc:name", namespaces=ns)[0].text
print("name:", name)

lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
                         "lom:metaMetadata/lom:identifier/"
                         "lom:entry",
                         namespaces=ns)[0].text

print('lom_entry:', lom_entry)

lom_ids = [id.text for id in
           record.xpath("zs:recordData/srw_dc:dc/"
                        "lom:classification/lom:taxonPath/"
                        "lom:taxon/lom:id",
                        namespaces=ns)]

print("lom_ids:", lom_ids)

Output: 输出:

name: Frank Malina
lom_entry: 2.6
lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']

libxml (http://xmlsoft.org/) Best, faster lib for xml parsing. libxml(http://xmlsoft.org/)用于xml解析的最佳,更快的lib。 There are implementation for python. 有python的实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM