简体   繁体   中英

Python ElementTree: ElementTree vs root Element

I'm a bit confused by some of the design decisions in the Python ElementTree API - they seem kind of arbitrary, so I'd like some clarification to see if these decisions have some logic behind them, or if they're just more or less ad hoc.

So, generally there are two ways you might want to generate an ElementTree - one is via some kind of source stream, like a file, or other I/O stream. This is achieved via the parse() function, or the ElementTree.parse() class method.

Another way is to load the XML directly from a string object. This can be done via the fromstring() function.

Okay, great. Now, I would think these functions would basically be identical in terms of what they return - the difference between the two of them is basically the source of input (one takes a file or stream object, the other takes a plain string.) Except for some reason the parse() function returns an ElementTree object, but the fromstring() function returns an Element object. The difference is basically that the Element object is the root element of an XML tree, whereas the ElementTree object is sort of a "wrapper" around the root element, which provides some extra features. You can always get the root element from an ElementTree object by calling getroot() .

Still, I'm confused why we have this distinction. Why does fromstring() return a root element directly, but parse() returns an ElementTree object? Is there some logic behind this distinction?

A beautiful answer comes from this old discussion :

Just for the record: Fredrik [the creator of ElementTree] doesn't actually consider it a design "quirk". He argues that it's designed for different use cases. While parse() parses a file, which normally contains a complete document (represented in ET as an ElementTree object), fromstring() and especially the 'literal wrapper' XML() are made for parsing strings, which (most?) often only contain XML fragments. With a fragment, you normally want to continue doing things like inserting it into another tree, so you need the top-level element in almost all cases.

And:

Why isn't et.parse the only way to do this? Why have XML or fromstring at all?

Well, use cases. XML() is an alias for fromstring(), because it's convenient (and well readable) to write

section = XML('A to Z') section.append(paragraphs)

for XML literals in source code. fromstring() is there because when you want to parse a fragment from a string that you got from whatever source, it's easy to express that with exactly that function, as in

  el = fromstring(some_string) 

If you want to parse a document from a file or file-like object, use parse(). Three use cases, three functions. The fourth use case of parsing a document from a string does not have its own function, because it is trivial to write

  tree = parse(BytesIO(some_byte_string)) 

I'm thinking the same as remram in the comments: parse takes a file location or a file object and preserves that information so that it can provide additional utility, which is really helpful. If parse did not return an ET object, then you would have to keep better track of the sources and whatnot in order to manually feed them back into the helper functions that ET objects have by default. In contrast to files, Strings- by definition- do not have the same kind of information attached from them, so you can't create the same utilities for them (otherwise there very well may be an ET.parsefromstring() method which would return an ET Object).

I suspect this is also the logic behind the method being named parse instead of ET.fromfile() : I would expect the same object type to be returned from fromfile and fromstring , but can't say I would expect the same from parse (it's been a long time since I started using ET, so there's no way to verify that, but that's my feeling).

On the subject Remram raised of placing utility methods on Elements, as I understand the documentation, Elements are extremely uniformed when it comes to implementation. People talk about "Root Elements," but the Element at the root of the tree is literally identical to all other Elements in terms of its class Attributes and Methods. As far as I know, Elements don't even know who their parent is, which is likely to support this uniformity. Otherwise there might be more code to implement the "root" Element (which doesn't have a parent) or to re-parent subelements. It seems to me that the simplicity of the Element class works greatly in its favor. So it seems better to me to leave Elements largely agnostic of anything above them (their parent, the file they come from) so there can't be any snags concerning 4 Elements with different output files in the same tree (or the like).

When it comes to implementing the module inside of code, it seems to me that the script would have to recognize the input as a file at some point, one way or another (otherwise it would be trying to pass the file to fromstring ). So there shouldn't arise a situation in which the output of parse should be unexpected such that the ElementTree is assumed to be an Element and processed as such (unless, of course, parse was implemented without the programmer checking to see what parse did, which just seems like a poor habit to me).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM