简体   繁体   中英

Python xml minidom. generate <text>Some text</text> element

I have the following code.

from xml.dom.minidom import Document

doc = Document()

root = doc.createElement('root')
doc.appendChild(root)
main = doc.createElement('Text')
root.appendChild(main)

text = doc.createTextNode('Some text here')
main.appendChild(text)

print doc.toprettyxml(indent='\t')

The result is:

<?xml version="1.0" ?>
<root>
    <Text>
        Some text here
    </Text>
</root>

This is all fine and dandy, but what if I want the output to look like this?

<?xml version="1.0" ?>
<root>
    <Text>Some text here</Text>
</root>

Can this easily be done?

Orjanp...

Can this easily be done?

It depends what exact rule you want, but generally no, you get little control over pretty-printing. If you want a specific format you'll usually have to write your own walker.

The DOM Level 3 LS parameter format-pretty-print in pxdom comes pretty close to your example. Its rule is that if an element contains only a single TextNode, no extra whitespace will be added. However it (currently) uses two spaces for an indent rather than four.

>>> doc= pxdom.parseString('<a><b>c</b></a>')
>>> doc.domConfig.setParameter('format-pretty-print', True)
>>> print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
  <b>c</b>
</a>

(Adjust encoding and output format for whatever type of serialisation you're doing.)

If that's the rule you want, and you can get away with it, you might also be able to monkey-patch minidom's Element.writexml, eg.:

>>> from xml.dom import minidom
>>> def newwritexml(self, writer, indent= '', addindent= '', newl= ''):
...     if len(self.childNodes)==1 and self.firstChild.nodeType==3:
...         writer.write(indent)
...         self.oldwritexml(writer) # cancel extra whitespace
...         writer.write(newl)
...     else:
...         self.oldwritexml(writer, indent, addindent, newl)
... 
>>> minidom.Element.oldwritexml= minidom.Element.writexml
>>> minidom.Element.writexml= newwritexml

All the usual caveats about the badness of monkey-patching apply.

I was looking for exactly the same thing, and I came across this post. (the indenting provided in xml.dom.minidom broke another tool that I was using to manipulate the XML, and I needed it to be indented) I tried the accepted solution with a slightly more complex example and this was the result:

In [1]: import pxdom

In [2]: xml = '<a><b>fda</b><c><b>fdsa</b></c></a>'

In [3]: doc = pxdom.parseString(xml)

In [4]: doc.domConfig.setParameter('format-pretty-print', True)

In [5]: print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
  <b>fda</b><c>
    <b>fdsa</b>
  </c>
</a>

The pretty printed XML isn't formatted correctly, and I'm not too happy about monkey patching (ie I barely know what it means, and understand it's bad), so I looked for another solution.

I'm writing the output to file, so I can use the xmlindent program for Ubuntu ($sudo aptitude install xmlindent). So I just write the unformatted to the file, then call the xmlindent from within the python program:

from subprocess import Popen, PIPE
Popen(["xmlindent", "-i", "2", "-w", "-f", "-nbe", file_name], 
      stderr=PIPE, 
      stdout=PIPE).communicate()

The -w switch causes the file to be overwritten, but annoyingly leaves a named eg "myfile.xml~" which you'll probably want to remove. The other switches are there to get the formatting right (for me).

PS xmlindent is a stream formatter, ie you can use it as follows:

cat myfile.xml | xmlindent > myfile_indented.xml

So you might be able to use it in a python script without writing to a file if you needed to.

This could be done with toxml(), using regular expressions to tidy things up.

>>> from xml.dom.minidom import Document
>>> import re
>>> doc = Document()
>>> root = doc.createElement('root')
>>> _ = doc.appendChild(root)
>>> main = doc.createElement('Text')
>>> _ = root.appendChild(main)
>>> text = doc.createTextNode('Some text here')
>>> _ = main.appendChild(text)
>>> out = doc.toxml()
>>> niceOut = re.sub(r'><', r'>\n<', re.sub(r'(<\/.*?>)', r'\1\n', out))
>>> print niceOut
<?xml version="1.0" ?>
<root>
<Text>Some text here</Text>
</root>

This solution worked for me without monkey patching or ceasing to use minidom:

from xml.dom.ext import PrettyPrint
from StringIO import StringIO

def toprettyxml_fixed (node, encoding='utf-8'):
    tmpStream = StringIO()
    PrettyPrint(node, stream=tmpStream, encoding=encoding)
    return tmpStream.getvalue()

http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace/#best-solution

Easiest way to do this is to use prettyxml, and remove the \\n and \\t inside the tags. That way you keep your indent as you required in your example.

xml_output = doc.toprettyxml() nojunkintags = re.sub('>(\\n|\\t)</', '', xml_output) print nojunkintags

The pyxml package offers a simple solution to this by using the xml.dom.ext.PrettyPrint() function. It can also print to a file descriptor.

But the pyxml package is no longer maintained.

Oerjan Pettersen

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM