简体   繁体   English

Python 中的漂亮印刷 XML

[英]Pretty printing XML in Python

What is the best way (or are the various ways) to pretty print XML in Python?在 Python 中漂亮地打印 XML 的最佳方法(或各种方法)是什么?

import xml.dom.minidom

dom = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml()

lxml is recent, updated, and includes a pretty print function lxml 是最近更新的,并包含一个漂亮的打印功能

import lxml.etree as etree

x = etree.parse("filename")
print etree.tostring(x, pretty_print=True)

Check out the lxml tutorial: http://lxml.de/tutorial.html查看 lxml 教程: http : //lxml.de/tutorial.html

Another solution is to borrow this indent function , for use with the ElementTree library that's built in to Python since 2.5.另一种解决方案是借用这个indent函数,用于自 2.5 以来内置于 Python 的 ElementTree 库。 Here's what that would look like:这就是它的样子:

from xml.etree import ElementTree

def indent(elem, level=0):
    i = "\n" + level*"  "
    j = "\n" + (level-1)*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for subelem in elem:
            indent(subelem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = j
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = j
    return elem        

root = ElementTree.parse('/tmp/xmlfile').getroot()
indent(root)
ElementTree.dump(root)

Here's my (hacky?) solution to get around the ugly text node problem.这是我的(hacky?)解决方案来解决丑陋的文本节点问题。

uglyXml = doc.toprettyxml(indent='  ')

text_re = re.compile('>\n\s+([^<>\s].*?)\n\s+</', re.DOTALL)    
prettyXml = text_re.sub('>\g<1></', uglyXml)

print prettyXml

The above code will produce:上面的代码会产生:

<?xml version="1.0" ?>
<issues>
  <issue>
    <id>1</id>
    <title>Add Visual Studio 2005 and 2008 solution files</title>
    <details>We need Visual Studio 2005/2008 project files for Windows.</details>
  </issue>
</issues>

Instead of this:取而代之的是:

<?xml version="1.0" ?>
<issues>
  <issue>
    <id>
      1
    </id>
    <title>
      Add Visual Studio 2005 and 2008 solution files
    </title>
    <details>
      We need Visual Studio 2005/2008 project files for Windows.
    </details>
  </issue>
</issues>

Disclaimer: There are probably some limitations.免责声明:可能有一些限制。

You have a few options.你有几个选择。

ElementTree.indent() ElementTree.indent()

Simple to use, pretty output.使用简单,输出漂亮。

But requires Python 3.9+但需要 Python 3.9+

import xml.etree.ElementTree as ET

element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))

BeautifulSoup .prettify() BeautifulSoup .prettify()

BeautifulSoup may be the simplest solution for Python <=3.9. BeautifulSoup 可能是 Python <=3.9 最简单的解决方案。

from bs4 import BeautifulSoup

bs = BeautifulSoup(open(xml_file), 'xml')
pretty_xml = bs.prettify()
print(pretty_xml)

Output:输出:

 <?xml version="1.0" encoding="utf-8"?> <issues> <issue> <id> 1 </id> <title> Add Visual Studio 2005 and 2008 solution files </title> </issue> </issues>

This is my goto answer.这是我的回答。 The default arguments work as is.默认参数按原样工作。 But text contents are spread out on separate lines as if they were nested elements.但是文本内容分散在单独的行上,就好像它们是嵌套元素一样。

lxml xml文件

Prettier output but with arguments.更漂亮的输出,但有参数。

from lxml import etree

x = etree.parse(FILE_NAME)
pretty_xml = etree.tostring(x, pretty_print=True, encoding=str)

Produces:产生:

 <issues> <issue> <id>1</id> <title>Add Visual Studio 2005 and 2008 solution files</title> <details>We need Visual Studio 2005/2008 project files for Windows.</details> </issue> </issues>

This works for me with no issues.这对我来说没有问题。


xml xml

No external dependencies but post-processing.没有外部依赖,但后处理。

import xml.dom.minidom as md

dom = md.parse(FILE_NAME)     
# To parse string instead use: dom = md.parseString(xml_string)
pretty_xml = dom.toprettyxml()
# remove the weird newline issue:
pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines()
                              if s.strip()])

The output is the same as above, but it's more code.输出与上面相同,但代码更多。

As others pointed out, lxml has a pretty printer built in.正如其他人指出的那样,lxml 内置了一个漂亮的打印机。

Be aware though that by default it changes CDATA sections to normal text, which can have nasty results.请注意,默认情况下它会将 CDATA 部分更改为普通文本,这可能会产生令人讨厌的结果。

Here's a Python function that preserves the input file and only changes the indentation (notice the strip_cdata=False ).这是一个保留输入文件并仅更改缩进的 Python 函数(注意strip_cdata=False )。 Furthermore it makes sure the output uses UTF-8 as encoding instead of the default ASCII (notice the encoding='utf-8' ):此外,它确保输出使用 UTF-8 作为编码而不是默认的 ASCII(注意encoding='utf-8' ):

from lxml import etree

def prettyPrintXml(xmlFilePathToPrettyPrint):
    assert xmlFilePathToPrettyPrint is not None
    parser = etree.XMLParser(resolve_entities=False, strip_cdata=False)
    document = etree.parse(xmlFilePathToPrettyPrint, parser)
    document.write(xmlFilePathToPrettyPrint, pretty_print=True, encoding='utf-8')

Example usage:用法示例:

prettyPrintXml('some_folder/some_file.xml')

As of Python 3.9, ElementTree has an indent() function for pretty-printing XML trees.从 Python 3.9 开始,ElementTree 有一个用于漂亮打印 XML 树的indent()函数。

See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent .请参阅https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent

Sample usage:示例用法:

import xml.etree.ElementTree as ET

element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))

The upside is that it does not require any additional libraries.好处是它不需要任何额外的库。 For more information check https://bugs.python.org/issue14465 and https://github.com/python/cpython/pull/15200有关更多信息,请查看https://bugs.python.org/issue14465https://github.com/python/cpython/pull/15200

If you have xmllint you can spawn a subprocess and use it.如果您有xmllint您可以生成一个子xmllint并使用它。 xmllint --format <file> pretty-prints its input XML to standard output. xmllint --format <file>其输入 XML 漂亮地打印到标准输出。

Note that this method uses an program external to python, which makes it sort of a hack.请注意,此方法使用 python 外部的程序,这使其有点像黑客。

def pretty_print_xml(xml):
    proc = subprocess.Popen(
        ['xmllint', '--format', '/dev/stdin'],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
    )
    (output, error_output) = proc.communicate(xml);
    return output

print(pretty_print_xml(data))

I tried to edit "ade"s answer above, but Stack Overflow wouldn't let me edit after I had initially provided feedback anonymously.我试图编辑上面“ade”的答案,但是在我最初匿名提供反馈后,Stack Overflow 不允许我进行编辑。 This is a less buggy version of the function to pretty-print an ElementTree.这是用于漂亮打印 ElementTree 的函数的错误较少的版本。

def indent(elem, level=0, more_sibs=False):
    i = "\n"
    if level:
        i += (level-1) * '  '
    num_kids = len(elem)
    if num_kids:
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
            if level:
                elem.text += '  '
        count = 0
        for kid in elem:
            indent(kid, level+1, count < num_kids - 1)
            count += 1
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
            if more_sibs:
                elem.tail += '  '
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i
            if more_sibs:
                elem.tail += '  '

If you're using a DOM implementation, each has their own form of pretty-printing built-in:如果您使用的是 DOM 实现,则每个实现都有自己的内置漂亮打印形式:

# minidom
#
document.toprettyxml()

# 4DOM
#
xml.dom.ext.PrettyPrint(document, stream)

# pxdom (or other DOM Level 3 LS-compliant imp)
#
serializer.domConfig.setParameter('format-pretty-print', True)
serializer.writeToString(document)

If you're using something else without its own pretty-printer — or those pretty-printers don't quite do it the way you want — you'd probably have to write or subclass your own serialiser.如果您正在使用其他没有自己的漂亮打印机的东西——或者那些漂亮的打印机没有按照你想要的方式去做——你可能必须编写或子类化你自己的序列化器。

from yattag import indent

pretty_string = indent(ugly_string)

It won't add spaces or newlines inside text nodes, unless you ask for it with:它不会在文本节点内添加空格或换行符,除非您要求它:

indent(mystring, indent_text = True)

You can specify what the indentation unit should be and what the newline should look like.您可以指定缩进单位应该是什么以及换行符应该是什么样子。

pretty_xml_string = indent(
    ugly_xml_string,
    indentation = '    ',
    newline = '\r\n'
)

The doc is on http://www.yattag.org homepage.该文档位于http://www.yattag.org主页上。

I had some problems with minidom's pretty print.我对 minidom 的漂亮印刷品有一些问题。 I'd get a UnicodeError whenever I tried pretty-printing a document with characters outside the given encoding, eg if I had a β in a document and I tried doc.toprettyxml(encoding='latin-1') .每当我尝试使用给定编码之外的字符漂亮地打印文档时,我都会得到一个 UnicodeError,例如,如果我在文档中有一个 β 并且我尝试了doc.toprettyxml(encoding='latin-1') Here's my workaround for it:这是我的解决方法:

def toprettyxml(doc, encoding):
    """Return a pretty-printed XML document in a given encoding."""
    unistr = doc.toprettyxml().replace(u'<?xml version="1.0" ?>',
                          u'<?xml version="1.0" encoding="%s"?>' % encoding)
    return unistr.encode(encoding, 'xmlcharrefreplace')

I wrote a solution to walk through an existing ElementTree and use text/tail to indent it as one typically expects.我编写了一个解决方案来遍历现有的 ElementTree 并使用 text/tail 将其缩进,正如人们通常所期望的那样。

def prettify(element, indent='  '):
    queue = [(0, element)]  # (level, element)
    while queue:
        level, element = queue.pop(0)
        children = [(level + 1, child) for child in list(element)]
        if children:
            element.text = '\n' + indent * (level+1)  # for child open
        if queue:
            element.tail = '\n' + indent * queue[0][0]  # for sibling open
        else:
            element.tail = '\n' + indent * (level-1)  # for parent close
        queue[0:0] = children  # prepend so children come before siblings

Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.这是一个 Python3 解决方案,它摆脱了丑陋的换行符问题(大量空格),并且与大多数其他实现不同,它只使用标准库。

import xml.etree.ElementTree as ET
import xml.dom.minidom
import os

def pretty_print_xml_given_root(root, output_xml):
    """
    Useful for when you are editing xml data on the fly
    """
    xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
    xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
    with open(output_xml, "w") as file_out:
        file_out.write(xml_string)

def pretty_print_xml_given_file(input_xml, output_xml):
    """
    Useful for when you want to reformat an already existing xml file
    """
    tree = ET.parse(input_xml)
    root = tree.getroot()
    pretty_print_xml_given_root(root, output_xml)

I found how to fix the common newline issue here .我在这里找到了如何解决常见的换行问题。

You can use popular external library xmltodict , with unparse and pretty=True you will get best result:您可以使用流行的外部库xmltodict ,使用unparsepretty=True您将获得最佳结果:

xmltodict.unparse(
    xmltodict.parse(my_xml), full_document=False, pretty=True)

full_document=False against <?xml version="1.0" encoding="UTF-8"?> at the top. full_document=False反对<?xml version="1.0" encoding="UTF-8"?>在顶部。

XML pretty print for python looks pretty good for this task. Python 的 XML 漂亮打印看起来非常适合此任务。 (Appropriately named, too.) (名字也恰到好处。)

An alternative is to use pyXML , which has a PrettyPrint function .另一种方法是使用pyXML ,它有一个PrettyPrint 函数

Take a look at the vkbeautify module.看看vkbeautify模块。

It is a python version of my very popular javascript/nodejs plugin with the same name.它是我非常流行的同名 javascript/nodejs 插件的 python 版本。 It can pretty-print/minify XML, JSON and CSS text.它可以漂亮地打印/缩小 XML、JSON 和 CSS 文本。 Input and output can be string/file in any combinations.输入和输出可以是任意组合的字符串/文件。 It is very compact and doesn't have any dependency.它非常紧凑,没有任何依赖性。

Examples :例子

import vkbeautify as vkb

vkb.xml(text)                       
vkb.xml(text, 'path/to/dest/file')  
vkb.xml('path/to/src/file')        
vkb.xml('path/to/src/file', 'path/to/dest/file') 

You can try this variation...你可以试试这个变体...

Install BeautifulSoup and the backend lxml (parser) libraries:安装BeautifulSoup和后端lxml (解析器)库:

user$ pip3 install lxml bs4

Process your XML document:处理您的 XML 文档:

from bs4 import BeautifulSoup

with open('/path/to/file.xml', 'r') as doc: 
    for line in doc: 
        print(BeautifulSoup(line, 'lxml-xml').prettify())  

An alternative if you don't want to have to reparse, there is the xmlpp.py library with the get_pprint() function.如果您不想重新解析,则可以使用带有get_pprint()函数的xmlpp.py 库 It worked nice and smoothly for my use cases, without having to reparse to an lxml ElementTree object.对于我的用例,它运行良好且顺利,无需重新解析为 lxml ElementTree 对象。

I had this problem and solved it like this:我遇到了这个问题并像这样解决了它:

def write_xml_file (self, file, xml_root_element, xml_declaration=False, pretty_print=False, encoding='unicode', indent='\t'):
    pretty_printed_xml = etree.tostring(xml_root_element, xml_declaration=xml_declaration, pretty_print=pretty_print, encoding=encoding)
    if pretty_print: pretty_printed_xml = pretty_printed_xml.replace('  ', indent)
    file.write(pretty_printed_xml)

In my code this method is called like this:在我的代码中,这个方法是这样调用的:

try:
    with open(file_path, 'w') as file:
        file.write('<?xml version="1.0" encoding="utf-8" ?>')

        # create some xml content using etree ...

        xml_parser = XMLParser()
        xml_parser.write_xml_file(file, xml_root, xml_declaration=False, pretty_print=True, encoding='unicode', indent='\t')

except IOError:
    print("Error while writing in log file!")

This works only because etree by default uses two spaces to indent, which I don't find very much emphasizing the indentation and therefore not pretty.这只是因为默认情况下 etree 使用two spaces来缩进,我觉得这不太强调缩进,因此不漂亮。 I couldn't ind any setting for etree or parameter for any function to change the standard etree indent.我无法找到任何 etree 设置或任何函数的参数来更改标准 etree 缩进。 I like how easy it is to use etree, but this was really annoying me.我喜欢使用 etree 是多么容易,但这真的让我很烦。

For converting an entire xml document to a pretty xml document用于将整个 xml 文档转换为漂亮的 xml 文档
(ex: assuming you've extracted [unzipped] a LibreOffice Writer .odt or .ods file, and you want to convert the ugly "content.xml" file to a pretty one for automated git version control and git difftool ing of .odt/.ods files , such as I'm implementing here ) (例如:假设您已经提取 [解压缩] 一个 LibreOffice Writer .odt 或 .ods 文件,并且您想将丑陋的“content.xml”文件转换为漂亮的文件以用于自动 git 版本控制.odt 的git difftool ing /.ods 文件,例如我在这里实施)

import xml.dom.minidom

file = open("./content.xml", 'r')
xml_string = file.read()
file.close()

parsed_xml = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = parsed_xml.toprettyxml()

file = open("./content_new.xml", 'w')
file.write(pretty_xml_as_string)
file.close()

References:参考:
- Thanks to Ben Noland's answer on this page which got me most of the way there. - 感谢Ben Noland 在这个页面上的回答,这让我在那里度过了大部分时间。

from lxml import etree
import xml.dom.minidom as mmd

xml_root = etree.parse(xml_fiel_path, etree.XMLParser())

def print_xml(xml_root):
    plain_xml = etree.tostring(xml_root).decode('utf-8')
    urgly_xml = ''.join(plain_xml .split())
    good_xml = mmd.parseString(urgly_xml)
    print(good_xml.toprettyxml(indent='    ',))

It's working well for the xml with Chinese!对于带有中文的xml来说效果很好!

If for some reason you can't get your hands on any of the Python modules that other users mentioned, I suggest the following solution for Python 2.7:如果由于某种原因您无法使用其他用户提到的任何 Python 模块,我建议使用以下 Python 2.7 解决方案:

import subprocess

def makePretty(filepath):
  cmd = "xmllint --format " + filepath
  prettyXML = subprocess.check_output(cmd, shell = True)
  with open(filepath, "w") as outfile:
    outfile.write(prettyXML)

As far as I know, this solution will work on Unix-based systems that have the xmllint package installed.据我所知,此解决方案适用于安装了xmllint软件包的基于 Unix 的系统。

I found this question while looking for "how to pretty print html"我在寻找“如何漂亮地打印 html”时发现了这个问题

Using some of the ideas in this thread I adapted the XML solutions to work for XML or HTML:使用此线程中的一些想法,我调整了 XML 解决方案以适用于 XML 或 HTML:

from xml.dom.minidom import parseString as string_to_dom

def prettify(string, html=True):
    dom = string_to_dom(string)
    ugly = dom.toprettyxml(indent="  ")
    split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
    if html:
        split = split[1:]
    pretty = '\n'.join(split)
    return pretty

def pretty_print(html):
    print(prettify(html))

When used this is what it looks like:使用时是这样的:

html = """\
<div class="foo" id="bar"><p>'IDK!'</p><br/><div class='baz'><div>
<span>Hi</span></div></div><p id='blarg'>Try for 2</p>
<div class='baz'>Oh No!</div></div>
"""

pretty_print(html)

Which returns:返回:

<div class="foo" id="bar">
  <p>'IDK!'</p>
  <br/>
  <div class="baz">
    <div>
      <span>Hi</span>
    </div>
  </div>
  <p id="blarg">Try for 2</p>
  <div class="baz">Oh No!</div>
</div>

Use etree.indent and etree.tostring使用etree.indentetree.tostring

import lxml.etree as etree

root = etree.fromstring('<html><head></head><body><h1>Welcome</h1></body></html>')
etree.indent(root, space="  ")
xml_string = etree.tostring(root, pretty_print=True).decode()
print(xml_string)

output输出

<html>
  <head/>
  <body>
    <h1>Welcome</h1>
  </body>
</html>

Removing namespaces and prefixes删除命名空间和前缀

import lxml.etree as etree


def dump_xml(element):
    for item in element.getiterator():
        item.tag = etree.QName(item).localname

    etree.cleanup_namespaces(element)
    etree.indent(element, space="  ")
    result = etree.tostring(element, pretty_print=True).decode()
    return result


root = etree.fromstring('<cs:document xmlns:cs="http://blabla.com"><name>hello world</name></cs:document>')
xml_string = dump_xml(root)
print(xml_string)

output输出

<document>
  <name>hello world</name>
</document>

I found an esay way to nicely print an xml file:我找到了一种很好地打印 xml 文件的方法:

import xml.etree.ElementTree as ET

xmlTree = ET.parse('your XML file')
xmlRoot = xmlTree.getroot()
xmlDoc =  ET.tostring(xmlRoot, encoding="unicode")

print(xmlDoc)

Outuput:输出:

<root>
  <child>
    <subchild>.....</subchild>
  </child>
  <child>
    <subchild>.....</subchild>
  </child>
  ...
  ...
  ...
  <child>
    <subchild>.....</subchild>
  </child>
</root>

I solved this with some lines of code, opening the file, going trough it and adding indentation, then saving it again.我用几行代码解决了这个问题,打开文件,遍历它并添加缩进,然后再次保存它。 I was working with small xml files, and did not want to add dependencies, or more libraries to install for the user.我正在处理小的 xml 文件,不想添加依赖项,或者为用户安装更多库。 Anyway, here is what I ended up with:无论如何,这就是我最终的结果:

    f = open(file_name,'r')
    xml = f.read()
    f.close()

    #Removing old indendations
    raw_xml = ''        
    for line in xml:
        raw_xml += line

    xml = raw_xml

    new_xml = ''
    indent = '    '
    deepness = 0

    for i in range((len(xml))):

        new_xml += xml[i]   
        if(i<len(xml)-3):

            simpleSplit = xml[i:(i+2)] == '><'
            advancSplit = xml[i:(i+3)] == '></'        
            end = xml[i:(i+2)] == '/>'    
            start = xml[i] == '<'

            if(advancSplit):
                deepness += -1
                new_xml += '\n' + indent*deepness
                simpleSplit = False
                deepness += -1
            if(simpleSplit):
                new_xml += '\n' + indent*deepness
            if(start):
                deepness += 1
            if(end):
                deepness += -1

    f = open(file_name,'w')
    f.write(new_xml)
    f.close()

It works for me, perhaps someone will have some use of it :)它对我有用,也许有人会用它:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM