Unicode：Python/lxml 文件输出不符合预期（打印与写入）

Question

I'm parsing an xml file using the code below:我正在使用以下代码解析 xml 文件：

import lxml

file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)

from lxml import etree

parser = lxml.etree.XMLParser()


tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()

nsmap = {'xmlns': 'urn:tva:metadata:2010'} 


with open(file_name+'.log', 'w', encoding='utf-8') as f:
    for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
       crid = (info.get('programId'))
       titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
       title = (titlex.text if titlex != None else 'Missing')
       synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))             
       synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')               
       synopsis1 = synopsis1.replace('\r','').replace('\n','')
       f.write('{}|{}|{}\n'.format(crid, title, synopsis1))

Let take an example title of 'Přešité bydlení'.让我们以“Přešité bydlení”的标题为例。 If I print the title whilst parsing the file, it comes out as expected.如果我在解析文件时打印标题，它会按预期显示。 When I write it out however, it displays as 'PÅ™eÅ¡itÃ© bydlenÃ'.然而，当我把它写出来时，它显示为“PÅ™eÅ¡ité bydlenÃ”。

I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired.我知道这与编码有关（因为我能够将打印命令更改为使用 UTF-8，并“损坏”输出），但是我无法按照我的需要打印输出。 I had a look at the codecs library, but couldn't wasn't successful.我查看了编解码器库，但不能成功。 Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.在 XML Parser 行中使用 'encoding = "utf-8"' 没有任何区别。

How can I configure the written output to be human readable?如何将书面输出配置为人类可读？

Answer 1

I had all sorts of troubles with this before.我以前遇到过各种各样的麻烦。 But the solution is rather simple.但解决方案相当简单。 There is a chapter on how to read and write in unicode to a file in the documentation . 文档中有关于如何以 unicode 读取和写入文件的章节。 This Python talk is also very enlightening to understand the issue.这个Python talk对理解这个问题也很有启发。 Unicode can be a pain. Unicode 可能很痛苦。 It gets a lot easier if you start using python 3 though.不过，如果您开始使用 python 3，它会变得容易得多。

import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()

Answer 2

Your code looks ok, so I reckon your input is duff.你的代码看起来不错，所以我认为你的输入是duff。 Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.假设您正在使用 UTF-8 查看器或 shell 查看输出文件，那么我怀疑<?xml中的编码与实际编码不匹配。

This would explain why printing works but not writing to a file.这将解释为什么打印有效但不能写入文件。 If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.如果您的外壳/IDE 设置为“ISO-8859-2”并且您的输入 XML 也是“ISO-8859-2”，那么打印就是推出原始编码。

Unicode：Python/lxml 文件输出不符合预期（打印与写入）

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-04-03 14:19:15

解决方案2
0 2014-04-03 21:27:48

Unicode：Python/lxml 文件输出不符合预期（打印与写入）

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-04-03 14:19:15

解决方案2 0 2014-04-03 21:27:48

解决方案1
2 已采纳 2014-04-03 14:19:15

解决方案2
0 2014-04-03 21:27:48