[英]Python codec error during file write with UTF-8 string
I'm working on a Python 3 Tkinter app (OS is Windows 10) whose overall functionality includes the following details:我正在开发一个 Python 3 Tkinter 应用程序(操作系统是 Windows 10),其整体功能包括以下详细信息:
Reading a number of text files which may contain data in ascii, cp1252, utf-8, or any other encoding读取可能包含 ascii、cp1252、utf-8 或任何其他编码格式数据的多个文本文件
Showing the contents of any of those files in a "preview window" (Tkinter Label widget).在“预览窗口”(Tkinter 标签小部件)中显示任何这些文件的内容。
Writing the file contents to a single output file (opening to append each time)将文件内容写入单个输出文件(每次打开以追加)
For #1: I've made the file read encoding-agnostic simply by opening and reading the files in binary mode.对于#1:我只是通过以二进制模式打开和读取文件来使文件读取编码不可知。 To convert the data to a string I use a loop which runs through a list of 'likely' encodings and tries each of them in turn (with
error='strict'
) until it hits one that doesn't throw an exception.要将数据转换为字符串,我使用了一个循环,该循环遍历“可能”编码列表并依次尝试每个编码(使用
error='strict'
),直到遇到不引发异常的编码。 This is working.这是有效的。
For #2: Once I've got the decoded string I just call the set()
method for the Tkinter Label's textvariable
.对于#2:一旦我获得了解码的字符串,我只需为 Tkinter 标签的
textvariable
调用set()
方法。 This is also working.这也有效。
For #3: I'm opening an output file in the usual way and calling the write()
method to write the decoded string.对于 #3:我以通常的方式打开一个输出文件并调用
write()
方法来写入解码后的字符串。 This works when the string was decoded as ascii or cp1252, but when it's decoded as utf-8 it throws an exception:这在字符串被解码为 ascii 或 cp1252 时有效,但当它被解码为 utf-8 时,它会抛出异常:
'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
I've searched around and found fairly similar questions but nothing that seems to address this specific problem.我四处搜索并发现了相当相似的问题,但似乎没有解决这个特定问题的问题。 Some further complications that restrict the solutions that will work for me:
一些进一步的复杂性限制了对我有用的解决方案:
A. I can sidestep the problem just by leaving the read-in data as bytes and opening/writing the output file as binary, but this renders some of the input file contents unreadable.答:我可以通过将读入数据保留为字节并将输出文件作为二进制文件打开/写入来回避这个问题,但这会导致某些输入文件内容不可读。
B. Although this app is mainly intended for Python 3, I'm trying to make it cross-compatible with Python 2 -- we have some slow/late adopters who will be using it. B. 虽然这个应用程序主要是为 Python 3 设计的,但我正在努力让它与 Python 2 交叉兼容——我们有一些缓慢/较晚的采用者将使用它。 (BTW, when I run the app on Python 2 it also throws exceptions but does so for both the cp1252 data and the utf-8 data.)
(顺便说一句,当我在 Python 2 上运行该应用程序时,它也会引发异常,但对 cp1252 数据和 utf-8 数据都如此。)
For the sake of illustrating the issue, I've created this stripped-down test script.为了说明这个问题,我创建了这个精简的测试脚本。 (My real application is a much larger project, and it's also proprietary to my company -- so it's not getting posted publicly!)
(我的实际应用程序是一个更大的项目,它也是我公司的专有项目——所以它不会公开发布!)
import tkinter as tk
import codecs
#Root window
root = tk.Tk()
#Widgets
ctrlViewFile1 = tk.StringVar()
ctrlViewFile2 = tk.StringVar()
ctrlViewFile3 = tk.StringVar()
lblViewFile1 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile1)
lblViewFile2 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile2)
lblViewFile3 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile3)
#Layout
lblViewFile1.grid(row=0,column=0,padx=5,pady=5)
lblViewFile2.grid(row=1,column=0,padx=5,pady=5)
lblViewFile3.grid(row=2,column=0,padx=5,pady=5)
#Bytes read from "files" (ascii Az5, cp1252 European letters/punctuation, utf-8 Mandarin characters)
inBytes1 = b'\x41\x7a\x35'
inBytes2 = b'\xe0\xbf\xf6'
inBytes3 = b'\xef\xbb\xbf\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e'
#Decode
outString1 = codecs.decode(inBytes1,'ascii','strict')
outString2 = codecs.decode(inBytes2,'cp1252','strict')
outString3 = codecs.decode(inBytes3,'utf_8','strict')
#Assign stringvars
ctrlViewFile1.set(outString1)
ctrlViewFile2.set(outString2)
ctrlViewFile3.set(outString3)
#Write output files
try:
with open('out1.txt','w') as outFile:
outFile.write(outString1)
except Exception as e:
print(inBytes1)
print(str(e))
try:
with open('out2.txt','w') as outFile:
outFile.write(outString2)
except Exception as e:
print(inBytes2)
print(str(e))
try:
with open('out3.txt','w') as outFile:
outFile.write(outString3)
except Exception as e:
print(inBytes3)
print(str(e))
#Start GUI
tk.mainloop()
I understand you want two things:我知道你想要两件事:
Using open('out1.txt','w')
violates both:使用
open('out1.txt','w')
违反了两者:
open
function differs considerably between Python versions. open
函数差异很大。 In Python 3, it is the io.open
function, which offers a lot of flexibility, such as specifying a text encoding.io.open
函数,它提供了很大的灵活性,例如指定文本编码。 In Python 2, the returned file handle processes 8-bit strings rather than Unicode strings (text). You can avoid all this with io.open('out1.txt', 'w', encoding='utf8')
:您可以使用
io.open('out1.txt', 'w', encoding='utf8')
避免所有这些:
io
module was backported to Python 2.7. io
模块被反向移植到 Python 2.7。 This generally qualifies as Py2/3 compatible, since support for versions <= 2.6 has ended quite some time ago. Side note: You mention a simple heuristic for detecting the input codec.旁注:您提到了一种用于检测输入编解码器的简单启发式方法。 If there's really no way to obtain this information, you should consider using chardet .
如果真的没有办法获得这些信息,你应该考虑使用chardet 。
Be explicit.明确一点。 You've opened for write using a default encoding.
您已使用默认编码打开写入。 Whatever it is, it doesn't support all Unicode code points.
不管它是什么,它都不支持所有的 Unicode 代码点。 Open the file with UTF-8 encoding, which does support all Unicode code points:
打开与UTF-8编码,它不支持所有Unicode代码点的文件:
import io
with io.open('out3.txt','w',encoding='utf8') as outFile:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.