简体   繁体   English

在Python中使用XML命名空间时,如何使我的代码更具可读性和DRYer?

[英]How can I make my code more readable and DRYer when working with XML namespaces in Python?

Python's built-in xml.etree package supports parsing XML files with namespaces, but namespace prefixes get expanded to the full URI enclosed in brackets. Python的内置xml.etree包支持使用命名空间解析XML文件,但是命名空间前缀扩展到括在括号中的完整URI。 So in the example file in the official documentation: 所以在官方文档中的示例文件中:

<actors xmlns:fictional="http://characters.example.com"
    xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    ...

The actor tag gets expanded to {http://people.example.com}actor and fictional:character to {http://characters.example.com}character . actor标签扩展为{http://people.example.com}actorfictional:character to {http://characters.example.com}character

I can see how this makes everything very explicit and reduces ambiguity (the file could have the same namespace with a different prefix, etc.) but it is very cumbersome to work with. 我可以看到这是如何使一切非常明确并减少歧义(文件可以具有相同的命名空间,具有不同的前缀等)但是使用起来非常麻烦。 The Element.find() method and others allow passing a dict mapping prefixes to namespace URIs so I can still do element.find('fictional:character', nsmap) but to my knowledge there is nothing similar for tag attributes. Element.find()方法和其他方法允许将dict映射前缀传递给名称空间URI,因此我仍然可以执行element.find('fictional:character', nsmap)但据我所知,标记属性没有任何类似之处。 This leads to annoying stuff like element.attrib['{{{}}}attrname'.format(nsmap['prefix'])] . 这会导致像element.attrib['{{{}}}attrname'.format(nsmap['prefix'])]这样烦人的东西。

The popular lxml package provides the same API with a few extensions, one of which is an nsmap property on the elements that they inherit from the document. 流行的lxml包提供了相同的API和一些扩展,其中一个是从文档继承的元素的nsmap属性。 However none of the methods seem to actually make use of it, so I still have to do element.find('fictional:character', element.nsmap) which is just unnecessarily repetitive to type out every time. 然而,这些方法似乎都没有实际使用它,所以我仍然需要做element.find('fictional:character', element.nsmap) ,每次输出都是不必要的重复。 It also still doesn't work with attributes. 它仍然不适用于属性。

Luckily lxml supports subclassing BaseElement , so I just made one with a p (for prefix) property that has the same API but automatically uses namespace prefixes using the element's nsmap ( Edit: likely best to assign a custom nsmap defined in code). 幸运的是lxml支持子类化BaseElement ,所以我只使用p (for prefix)属性创建了一个具有相同API的属性,但使用元素的nsmap自动使用名称空间前缀( 编辑:最好分配代码中定义的自定义nsmap )。 So I just do element.p.find('fictional:character') or element.p.attrib['prefix:attrname'] , which much less repetitive and I think way more readable. 所以我只做element.p.find('fictional:character')element.p.attrib['prefix:attrname'] ,这更不重复,我认为更具可读性。

I just feel like I'm really missing something though, it really feels like this should really already be a feature of lxml if not the builtin etree package. 我只是觉得我真的错过了一些东西,如果不是内置的etree包,它真的感觉这应该是lxml一个特性。 Am I somehow doing this wrong? 我在某种程度上做错了吗?

Is it possible to get rid of the namespace mapping? 是否有可能摆脱命名空间映射?

Do you need to pass it as a parameter into each function call? 你需要将它作为参数传递给每个函数调用吗? An option would be to set the prefixes to be used at the XML document in a property. 一个选项是设置要在属性中的XML文档中使用的前缀。

That's fine until you pass the XML document into a 3rd party function. 在将XML文档传递给第三方函数之前,这很好。 That function wants to use prefixes as well, so it sets the property to something else, because it does not know what you set it to. 该函数也希望使用前缀,因此它将属性设置为其他内容,因为它不知道您将其设置为什么。

As soon as you get the XML document back, it was modified, so your prefixes don't work any more. 一旦您获得XML文档,它就被修改,因此您的前缀不再起作用。

All in all: no, it's not safe and therefore it's good as it is. 总而言之:不,它不安全,因此它很好。

This design does not only exist in Python, it also exists in .NET. 这种设计不仅存在于Python中,它还存在于.NET中。 The SelectNodes() [MSDN] can be used if you don't need prefixes. 如果您不需要前缀,可以使用SelectNodes() [MSDN] But as soon as there's a prefix present, it'll throw an exception. 但只要有前缀,它就会抛出异常。 Therefore, you have to use the overloaded SelectNodes() [MSDN] which uses an XmlNamespaceManager as a parameter. 因此,您必须使用重载的SelectNodes() [MSDN] ,它使用XmlNamespaceManager作为参数。

XPath as a solution XPath作为解决方案

I suggest to learn XPath (lxml specific link) , where you can use prefixes. 我建议学习XPath(lxml特定链接) ,你可以使用前缀。 Since this may be version specific, let me say I ran this code with Python 2.7 x64 and lxml 3.6.0 (I'm not too familiar with Python, so this may not be the cleanest code, but it serves well as a demonstration): 由于这可能是版本特定的,让我说我使用Python 2.7 x64和lxml 3.6.0运行此代码(我不太熟悉Python,所以这可能不是最干净的代码,但它很适合作为演示) :

from lxml import etree as ET
from pprint import pprint
data = """<?xml version="1.0"?>
<d:data xmlns:d="dns">
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor d:name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</d:data>"""
root = ET.fromstring(data)
my_namespaces = {'x':'dns'}
xp=root.xpath("/x:data/country/neighbor/@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("//@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("/x:data/country/neighbor/@name", namespaces=my_namespaces)
pprint(xp)

The output is 输出是

C:\Python27x64\python.exe E:/xpath.py
['Austria']
['Austria']
['Switzerland', 'Malaysia']

Process finished with exit code 0

Note how well XPath solved the mapping from x prefix in the namespace table to the d prefix in the XML document. 请注意XPath如何解决从命名空间表中的x前缀到XML文档中的d前缀的映射。

This eliminates the really awful to read element.attrib['{{{}}}attrname'.format(nsmap['prefix'])] . 这消除了读取element.attrib['{{{}}}attrname'.format(nsmap['prefix'])]的真正糟糕。

Short (and incomplete) XPath introduction 简短(和不完整)XPath简介

To select an element, write /element , optionally use a prefix. 要选择元素,请写/element ,可选择使用前缀。

xp=root.xpath("/x:data", namespaces=my_namespaces)

To select an attribute, write /@attribute , optionally use a prefix. 要选择属性,请写/@attribute ,可选择使用前缀。

#See example above

To navigate down, concatenate several elements. 要向下导航,请连接几个元素。 Use // if you don't know items in between. 如果您不知道介于两者之间的项目,请使用// To move up, use /.. . 要向上移动,请使用/.. Attributes must be last if not followed by /.. . 如果没有后跟/..属性必须是最后的。

xp=root.xpath("/x:data/country/neighbor/@x:name/..", namespaces=my_namespaces)

To use a condition, write it in square brackets. 要使用条件,请将其写在方括号中。 /element[@attribute] means: select all elements that have this attribute. /element[@attribute]表示:选择具有此属性的所有元素。 /element[@attribute='value'] means: select all elements that have this attribute and the attribute has a specific value. /element[@attribute='value']表示:选择具有此属性且属性具有特定值的所有元素。 /element[./subelement] means: select all elements that have a subelement with a specific name. /element[./subelement]表示:选择具有特定名称的子元素的所有元素。 Optionally use prefixes anywhere. 可选择在任何地方使用前缀

xp=root.xpath("/x:data/country[./neighbor[@name='Switzerland']]/@name", namespaces=my_namespaces)

There's much more to discover, like text() , various ways of sibling selection and even functions. 还有更多要发现的东西,比如text() ,兄弟选择的各种方式甚至功能。

About the 'why' 关于'为什么'

The original question title which was 原来的问题标题是

Why does working with XML namespaces seem so difficult in Python? 为什么使用XML命名空间在Python中看起来如此困难?

For some users, they just don't understand the concept. 对于一些用户,他们只是不理解这个概念。 If the user understands the concept, maybe the developer didn't. 如果用户理解这个概念,那么开发人员可能没有。 And perhaps it was just one option out of many and the decision was to go that direction. 也许这只是众多选择中的一个选择,而决定是朝这个方向发展。 The only person who could give an answer on the "why" part in such a case would be the developer himself. 在这种情况下唯一可以对“为什么”部分给出答案的人就是开发者本人。

References 参考

If you need to avoid repeating nsmap parameters using ElementTree in Python, consider transforming your XML with XSLT to remove namespaces and return local element names. 如果您需要避免在Python中使用ElementTree重复nsmap参数,请考虑使用XSLT转换XML以删除命名空间并返回本地元素名称。 And Python's lxml can run XSLT 1.0 scripts. Python的lxml可以运行XSLT 1.0脚本。

As information, XSLT is a special-purpose declarative language (same family as XPath but interacts with whole documents) used specifically to transform XML sources. 作为信息, XSLT是一种特殊用途的声明性语言(与XPath相同,但与整个文档交互)专门用于转换XML源。 In fact, XSLT scripts are well-formed XML documents! 实际上,XSLT脚本是格式良好的XML文档! And removing namespaces is an often used task for end user needs. 删除命名空间是最终用户需求的常用任务。

Consider the following with XML and XSLT embedded as strings (but each can be parsed from file). 考虑以下内容,将XML和XSLT嵌入为字符串(但每个都可以从文件中解析)。 Once transformed, you can run .findall() , iter() , and .xpath() on the transformed new tree object without need of defining namespace prefixes: 转换后,您可以在转换后的新树对象上运行.findall()iter().xpath() ,而无需定义名称空间前缀:

Script 脚本

import lxml.etree as ET

# LOAD XML AND XSL STRINGS
xmlStr = '''
         <actors xmlns:fictional="http://characters.example.com"
                 xmlns="http://people.example.com">
             <actor>
                 <name>John Cleese</name>
                 <fictional:character>Lancelot</fictional:character>
                 <fictional:character>Archie Leach</fictional:character>
             </actor>
         </actors>
         '''
dom = ET.fromstring(xmlStr)

xslStr = '''
        <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
        <xsl:strip-space elements="*"/>

          <xsl:template match="@*|node()">
            <xsl:element name="{local-name()}">
              <xsl:apply-templates select="@*|node()"/>
            </xsl:element>
          </xsl:template>

          <xsl:template match="text()">
            <xsl:copy/>
          </xsl:template>

        </xsl:transform>
        '''
xslt = ET.fromstring(xslStr)

# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUT AND PARSE
print(str(newdom))

for i in newdom.findall('//character'):
    print(i.text)

for i in newdom.iter('character'):
    print(i.text)

for i in newdom.xpath('//character'):
    print(i.text)

Output 产量

<?xml version="1.0"?>
<actors>
  <actor>
    <name>John Cleese</name>
    <character>Lancelot</character>
    <character>Archie Leach</character>
  </actor>
</actors>

Lancelot
Archie Leach
Lancelot
Archie Leach
Lancelot
Archie Leach

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM