[英]Writing an XML header with LXML
我目前正在編寫一個腳本,將一堆XML文件從各種編碼轉換為統一的UTF-8。
我首先嘗試使用LXML確定編碼:
def get_source_encoding(self):
tree = etree.parse(self.inputfile)
encoding = tree.docinfo.encoding
self.inputfile.seek(0)
return (encoding or '').lower()
如果那是空白的,我嘗試從chardet
獲取它:
def guess_source_encoding(self):
chunk = self.inputfile.read(1024 * 10)
self.inputfile.seek(0)
return chardet.detect(chunk).lower()
然后我使用codecs
轉換文件的編碼:
def convert_encoding(self, source_encoding, input_filename, output_filename):
chunk_size = 16 * 1024
with codecs.open(input_filename, "rb", source_encoding) as source:
with codecs.open(output_filename, "wb", "utf-8") as destination:
while True:
chunk = source.read(chunk_size)
if not chunk:
break;
destination.write(chunk)
最后,我正在嘗試重寫XML標頭。 如果最初是XML標頭
<?xml version="1.0"?>
要么
<?xml version="1.0" encoding="windows-1255"?>
我想把它變成
<?xml version="1.0" encoding="UTF-8"?>
我目前的代碼似乎不起作用:
def edit_header(self, input_filename):
output_filename = tempfile.mktemp(suffix=".xml")
with open(input_filename, "rb") as source:
parser = etree.XMLParser(encoding="UTF-8")
tree = etree.parse(source, parser)
with open(output_filename, "wb") as destination:
tree.write(destination, encoding="UTF-8")
我正在測試的文件有一個沒有指定編碼的標頭。 如何使用指定的編碼正確輸出標題?
嘗試:
tree.write(destination, xml_declaration=True, encoding='UTF-8')
來自API文檔 :
xml_declaration控制是否應將XML聲明添加到文件中。 使用
False
的永遠,True
為始終,None
僅供如果不是US-ASCII或UTF-8(默認為None
)。
來自ipython的示例:
In [15]: etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>
經過反思,我覺得你太努力了。 lxml
自動檢測編碼並根據該編碼正確解析文件。
所以你真正要做的事情(至少在Python2.7中)是:
def convert_encoding(self, source_encoding, input_filename, output_filename):
tree = etree.parse(input_filename)
with open(output_filename, 'w') as destination:
tree.write(destination, encoding='utf-8', xml_declaration=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.