简体   繁体   English

使用LXML编写XML标头

[英]Writing an XML header with LXML

I'm currently writing a script to convert a bunch of XML files from various encodings to a unified UTF-8. 我目前正在编写一个脚本,将一堆XML文件从各种编码转换为统一的UTF-8。

I first try determining the encoding using LXML: 我首先尝试使用LXML确定编码:

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

If that's blank, I try getting it from chardet : 如果那是空白的,我尝试从chardet获取它:

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

I then use codecs to convert the encoding of the file: 然后我使用codecs转换文件的编码:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename, "rb", source_encoding) as source:
        with codecs.open(output_filename, "wb", "utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

Finally, I'm attempting to rewrite the XML header. 最后,我正在尝试重写XML标头。 If the XML header was originally 如果最初是XML标头

<?xml version="1.0"?>

or 要么

<?xml version="1.0" encoding="windows-1255"?>

I'd like to transform it to 我想把它变成

<?xml version="1.0" encoding="UTF-8"?>

My current code doesn't seem to work: 我目前的代码似乎不起作用:

def edit_header(self, input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename, "rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source, parser)

        with open(output_filename, "wb") as destination:
            tree.write(destination, encoding="UTF-8")

The file I'm currently testing has a header that doesn't specify the encoding. 我正在测试的文件有一个没有指定编码的标头。 How can I make it output the header properly with the encoding specified? 如何使用指定的编码正确输出标题?

Try: 尝试:

tree.write(destination, xml_declaration=True, encoding='UTF-8')

From the API docs : 来自API文档

xml_declaration controls if an XML declaration should be added to the file. xml_declaration控制是否应将XML声明添加到文件中。 Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None ). 使用False的永远, True为始终, None仅供如果不是US-ASCII或UTF-8(默认为None )。

Sample from ipython: 来自ipython的示例:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

On reflection, I think you trying way too hard. 经过反思,我觉得你太努力了。 lxml automatically detects the encoding and correctly parses the file according to that encoding. lxml自动检测编码并根据该编码正确解析文件。

So all you really have to do (at least in Python2.7) is: 所以你真正要做的事情(至少在Python2.7中)是:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename, 'w') as destination:
        tree.write(destination, encoding='utf-8', xml_declaration=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM