简体   繁体   中英

python - lxml: enforcing a specific order for attributes

I have an XML writing script that outputs XML for a specific 3rd party tool.

I've used the original XML as a template to make sure that I'm building all the correct elements, but the final XML does not appear like the original.

I write the attributes in the same order, but lxml is writing them in its own order.

I'm not sure, but I suspect that the 3rd part tool expects attributes to appear in a specific order, and I'd like to resolve this issue so I can see if its the attrib order that making it fail, or something else.

Source element:

<FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="text/x-test-signature"> 

My source script:

sig.fileformat = etree.SubElement(sig.fileformats, "FileFormat", ID = str(db.ID), Name = db.name, PUID="fileSig/{}".format(str(db.ID)), Version = "", MIMEType = "")

My resultant XML:

<FileFormat MIMEType="" PUID="fileSig/19" Version="" Name="Printer Info File" ID="19">

Is there a way of constraining the order they are written?

OrderedDict of attributes

As of lxml 3.3.3 (perhaps also in earlier versions) you can pass an OrderedDict of attributes to the lxml.etree.(Sub)Element constructor and the order will be preserved when using lxml.etree.tostring(root) :

sig.fileformat = etree.SubElement(sig.fileformats, "FileFormat", OrderedDict([("ID",str(db.ID)), ("Name",db.name), ("PUID","fileSig/{}".format(str(db.ID))), ("Version",""), ("MIMEType","")]))

Note that the ElementTree API ( xml.etree.ElementTree ) does not preserve attribute order even if you provide an OrderedDict to the xml.etree.ElementTree.(Sub)Element constructor!

UPDATE: Also note that using the **extra parameter of the lxml.etree.(Sub)Element constructor for specifying attributes does not preserve attribute order:

>>> from lxml.etree import Element, tostring
>>> from collections import OrderedDict
>>> root = Element("root", OrderedDict([("b","1"),("a","2")])) # attrib parameter
>>> tostring(root)
b'<root b="1" a="2"/>' # preserved
>>> root = Element("root", b="1", a="2") # **extra parameter
>>> tostring(root)
b'<root a="2" b="1"/>' # not preserved

It looks like lxml serializes attributes in the order you set them:

>>> from lxml import etree as ET
>>> x = ET.Element("x")
>>> x.set('a', '1')
>>> x.set('b', '2')
>>> ET.tostring(x)
'<x a="1" b="2"/>'
>>> y= ET.Element("y")
>>> y.set('b', '2')
>>> y.set('a', '1')
>>> ET.tostring(y)
'<y b="2" a="1"/>'

Note that when you pass attributes using the ET.SubElement() constructor, Python constructs a dictionary of keyword arguments and passes that dictionary to lxml. This loses any ordering you had in the source file, since Python's dictionaries are unordered (or, rather, their order is determined by string hash values, which may differ from platform to platform or, in fact, from execution to execution).

Attribute ordering and readability As the commenters have mentioned, attribute order has no semantic significance in XML, which is to say it doesn't change the meaning of an element:

<tag attr1="val1" attr2="val2"/>

<!-- means the same thing as: -->

<tag attr2="val2" attr1="val1"/>

There is an analogous characteristic in SQL, where column order doesn't change the meaning of a table definition. XML attributes and SQL columns are a set (not an ordered set ), and so all that can "officially" be said about either one of those is whether the attribute or column is present in the set.

That said, it definitely makes a difference to human readability which order these things appear in and in situations where constructs like this are authored and appear in text (eg source code) and must be interpreted, a careful ordering makes a lot of sense to me.

Typical parser behavior

Any XML parser that treated attribute order as significant would be out of compliance with the XML standard. That doesn't mean it can't happen, but in my experience it is certainly unusual. Still, depending on the provenence of the tool you mention, it's a possibility that may be worth testing.

As far as I know, lxml has no mechanism for specifying the order attributes appear in serialized XML, and I would be surprised if it did.

In order to test the behavior I'd be strongly inclined to just write a text-based template to generate enough XML to test it out:

id = 1
name = 'Development Signature'
puid = 'dev/1'
version = '1.0'
mimetype = 'text/x-test-signature'

template = ('<FileFormat ID="%d" Name="%s" PUID="%s" Version="%s" '
            'MIMEType="%s">')

xml = template % (id, name, puid, version, mimetype)

I have seen order matter where the consumer of the XML is expecting canonicalized XML. Canonical XML specifies that the attributes be sorted:

in increasing lexicographic order with namespace URI as the primary key and local name as the secondary key (an empty namespace URI is lexicographically least). (section 2.6 of https://www.w3.org/TR/xml-c14n2/ )

So if your application is expecting the kind of order you would get out of canonical XML, lxml does support output in canonical form using the method= argument to print. (see heading C14N of https://lxml.de/api.html )

For example:

from lxml import etree as ET 
element = ET.Element('Test', B='beta', Z='omega', A='alpha') 
val = ET.tostring(element, method="c14n") 
print(val)

You need to encapsulate a new string, which gives order when compared, and gives value when print and get strings.

Here is an example:

class S:
    def __init__(self, _idx, _obj):
        self._obj = (_idx, _obj)

    def get_idx(self):
        return self._obj[0]

    def __le__(self, other):
        return self._obj[0] <= other.get_idx()

    def __lt__(self, other):
        return self._obj[0] < other.get_idx()

    def __str__(self):
        return self._obj[1].__str__()

    def __repr__(self):
        return self._obj[1].__repr__()

    def __eq__(self, other):
        if isinstance(other, str):
            return self._obj[1] == other
        elif isinstance(other, S):
            return self._obj[
                       0] == other.get_idx() and self.__str__() == other.__str__()
        else:
            return self._obj[
                0] == other.get_idx() and self._obj[1] == other

    def __add__(self, other):
        return self._obj[1] + other

    def __hash__(self):
        return self._obj[1].__hash__()

    def __getitem__(self, item):
        return self._obj[1].__getitem__(item)

    def __radd__(self, other):
        return other + self._obj[1]

list_sortable = ['c', 'b', 'a']
list_not_sortable = [S(0, 'c'), S(0, 'b'), S(0, 'a')]
print("list_sortable ---- Before sort ----")
for ele in list_sortable:
    print(ele)
print("list_not_sortable ---- Before sort ----")
for ele in list_not_sortable:
    print(ele)
list_sortable.sort()
list_not_sortable.sort()
print("list_sortable ---- After sort ----")
for ele in list_sortable:
    print(ele)
print("list_not_sortable ---- After sort ----")
for ele in list_not_sortable:
    print(ele)

running result:

list_sortable ---- Before sort ----
c
b
a
list_not_sortable ---- Before sort ----
c
b
a
list_sortable ---- After sort ----
a
b
c
list_not_sortable ---- After sort ----
c
b
a
dict_sortable ---- After sort ----
a 3
b 2
c 1
dict_not_sortable ---- After sort ----
c 1
b 2
a 3

lxml uses libxml2 under the hood. It preserves attribute order, which means for an individual element you can sort them like this:

x = etree.XML('<x a="1" b="2" d="4" c="3"><y></y></x>')
sorted_attrs = sorted(x.attrib.items())
x.attrib.clear()
x.attrib.update(sorted_attrs)

Not very helpful if you want them all sorted though. If you want them all sorted you can use the c14n2 output method (XML Canonicalisation Version 2):

>>> x = etree.XML('<x a="1" b="2" d="4" c="3"><y></y></x>')
>>> etree.tostring(x, method="c14n2")
b'<x a="1" b="2" c="3" d="4"><y></y></x>'

That will sort the attributes. Unfortunately it has the downside of ignoring pretty_print , which isn't great if you want human-readable XML.

If you use c14n2 then lxml will use custom Python serialisation code to write the XML which calls sorted(x.attrib.items() itself for all attributes. If you don't, then it will instead call into libxml2's xmlNodeDumpOutput() function which doesn't support sorting attributes but does support pretty-printing.

Therefore the only solution is to manually walk the XML tree and sort all the attributes, like this:

from lxml import etree

x = etree.XML('<x a="1" b="2" d="4" c="3"><y z="1" a="2"><!--comment--></y></x>')
for el in x.iter(etree.Element):
    sorted_attrs = sorted(el.attrib.items())
    el.attrib.clear()
    el.attrib.update(sorted_attrs)

etree.tostring(x, pretty_print=True)

# b'<x a="1" b="2" c="3" d="4">\n  <y a="2" z="1">\n    <!--comment-->\n  </y>\n</x>\n'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM