好的python XML解析器可以处理命名空间繁重的文档

Question

Python elementTree seems unusable with namespaces. Python elementTree似乎无法使用命名空间。 What are my alternatives? 我有什么选择？ BeautifulSoup is pretty rubbish with namespaces too. BeautifulSoup也很容易使用名称空间。 I don't want to strip them out. 我不想把它们剥掉。

Examples of how a particular python library gets namespaced elements and their collections are all +1. 特定python库如何获取命名空间元素及其集合的示例均为+1。

Edit: Could you provide code to deal with this real world use-case using your library of choice? 编辑：您能否使用您选择的库提供代码来处理这个真实世界的用例？

How would you go about getting strings 'Line Break', '2.6' and a list ['PYTHON', 'XML', 'XML-NAMESPACES'] 你将如何获得字符串'Line Break'，'2.6'和列表['PYTHON'，'XML'，'XML-NAMESPACES']

<?xml version="1.0" encoding="UTF-8"?>
<zs:searchRetrieveResponse
    xmlns="http://unilexicon.com/vocabularies/"
    xmlns:zs="http://www.loc.gov/zing/srw/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:lom="http://ltsc.ieee.org/xsd/LOM">
    <zs:records>
        <zs:record>
            <zs:recordData>
                <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema">
                    <name>Line Break</name>
                    <dc:title>Processing XML namespaces using Python</dc:title>
                    <dc:description>How to get contents string from an element,
                        how to get a collection in a list...</dc:description>
                    <lom:metaMetadata>
                        <lom:identifier>
                            <lom:catalog>Python</lom:catalog>
                            <lom:entry>2.6</lom:entry>
                        </lom:identifier>
                    </lom:metaMetadata>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>PYTHON</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>XML</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                    <lom:classification>
                        <lom:taxonPath>
                            <lom:taxon>
                                <lom:id>XML-NAMESPACES</lom:id>
                            </lom:taxon>
                        </lom:taxonPath>
                    </lom:classification>
                </srw_dc:dc>
            </zs:recordData>
        </zs:record>
        <!-- ... more records ... -->
    </zs:records>
</zs:searchRetrieveResponse>

Answer 1

lxml is namespace-aware. lxml是名称空间感知的。

>>> from lxml import etree
>>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""")
>>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
'<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>'
>>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
[<Element {foo}bar at ...>]

Edit: On your example: 编辑：在您的示例中：

from lxml import etree

# remove the b prefix in Python 2
# needed in python 3 because
# "Unicode strings with encoding declaration are not supported."
et = etree.XML(b"""...""")

ns = {
    'lom': 'http://ltsc.ieee.org/xsd/LOM',
    'zs': 'http://www.loc.gov/zing/srw/',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'voc': 'http://www.schooletc.co.uk/vocabularies/',
    'srw_dc': 'info:srw/schema/1/dc-schema'
}

# according to docs, .xpath returns always lists when querying for elements
# .find returns one element, but only supports a subset of XPath
record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
# in this example, we know there's only one record
# but else, you should apply the following to all elements the above returns

name = record.xpath("//voc:name", namespaces=ns)[0].text
print("name:", name)

lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
                         "lom:metaMetadata/lom:identifier/"
                         "lom:entry",
                         namespaces=ns)[0].text

print('lom_entry:', lom_entry)

lom_ids = [id.text for id in
           record.xpath("zs:recordData/srw_dc:dc/"
                        "lom:classification/lom:taxonPath/"
                        "lom:taxon/lom:id",
                        namespaces=ns)]

print("lom_ids:", lom_ids)

Output: 输出：

name: Frank Malina
lom_entry: 2.6
lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']

Answer 2

How about: 怎么样：

http://docs.python.org/library/pyexpat.html http://docs.python.org/library/pyexpat.html

Answer 3

libxml (http://xmlsoft.org/) Best, faster lib for xml parsing. libxml（http://xmlsoft.org/）用于xml解析的最佳，更快的lib。 There are implementation for python. 有python的实现。

好的python XML解析器可以处理命名空间繁重的文档

问题描述

3 个解决方案

解决方案1
13 已采纳

解决方案2
1 2010-09-24 09:12:44

解决方案3
0 2010-09-24 09:54:50

好的python XML解析器可以处理命名空间繁重的文档

问题描述

3 个解决方案

解决方案1 13 已采纳

解决方案2 1 2010-09-24 09:12:44

解决方案3 0 2010-09-24 09:54:50

解决方案1
13 已采纳

解决方案2
1 2010-09-24 09:12:44

解决方案3
0 2010-09-24 09:54:50