简体   繁体   中英

Distinguishing between <foo/> and <foo></foo> in Python XML parsing and generation

I have been using Python's ElementTree to create an XML document, and so far so good. Yet the problem I am now facing is that due to project requirements, I need to produce an XML document which has elements with start and end tags as well as self-closing tag elements. I need to output empty tags with start/end tags and also keep self-closed tag elements. The current implementation either produces self-closing tags when there are empty elements and thus keeps the self-closing tags, this is not correct due to project requirements. Also, if I force start/end tags for empty elements, the self-closing tags are also transformed into start/end tag elements, this is not correct either.

Can some one please help me out and point me to a possible solution, any all suggestions are welcomed. I need to use Python 2.7. Thank you.

As the XML standard is concerned, an empty tag means the exact same thing as a self-closing tag.

So, first, this probably isn't a good idea in the first place.

And second, most XML libraries probably aren't going to let you distinguish between the two.

But if you need to do this, you can always patch any library you want. Since you're already using ElementTree , that seems like the obvious choice to patch.


In the latest versions of ElementTree (including the version that comes with Python 3.4+, but in older Pythons you'll need to install the latest externally-maintained version), you can actually control this globally , with the short_empty_elements argument to write and related functions. But, as you say, this isn't what you actually want; you need some elements to be self-closing and some not.

I think you'd be better off starting from the externally-maintained version of ElementTree , rather than the version that comes built in with Python 2.7. But I'm not sure where its official repo is, so I'm going to link to the Python 3.4 code instead. Hopefully that gives you enough to take it from there.

The key function is serialize_xml . I think that function isn't C-accelerated, so you only need to change the pure Python version. In which case it's just one line:

if text or len(elem) or not short_empty_elements:

Change it to:

if text or len(elem) or not getattr(elem, 'short_empty', short_empty_elements):

And now, if you set node.short_empty = True or node.short_empty = False on an empty node, it will override the global settings for short_empty_elements .


Except… I think if you're using the C accelerator, you can't add attributes (I mean Python attributes, like node.short_empty , not XML attributes) to an Element . Which means you'll either need to patch Element to allow this (which is partly in C —you'll have to not disable the __dict__ and modify the else to call PyObject_GenericSetAttr instead of raising), or fake it by, eg, using some fake XML attribute, which you strip out when serializing.

Of course if you're using ElementTree rather than cElementTree in 2.7, you're not using the C accelerator, so you probably don't need to worry about this part.


You might want to consider looking at the lxml implementation of the ElementTree API to see if it's easier to patch.


Meanwhile, considering that they've added short_empty_elements to the library, the maintainers might be interested in accepting your patch upstream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM