简体   繁体   English

使用 UTF-8 字符串写入文件时出现 Python 编解码器错误

[英]Python codec error during file write with UTF-8 string

I'm working on a Python 3 Tkinter app (OS is Windows 10) whose overall functionality includes the following details:我正在开发一个 Python 3 Tkinter 应用程序(操作系统是 Windows 10),其整体功能包括以下详细信息:

  1. Reading a number of text files which may contain data in ascii, cp1252, utf-8, or any other encoding读取可能包含 ascii、cp1252、utf-8 或任何其他编码格式数据的多个文本文件

  2. Showing the contents of any of those files in a "preview window" (Tkinter Label widget).在“预览窗口”(Tkinter 标签小部件)中显示任何这些文件的内容。

  3. Writing the file contents to a single output file (opening to append each time)将文件内容写入单个输出文件(每次打开以追加)

For #1: I've made the file read encoding-agnostic simply by opening and reading the files in binary mode.对于#1:我只是通过以二进制模式打开和读取文件来使文件读取编码不可知。 To convert the data to a string I use a loop which runs through a list of 'likely' encodings and tries each of them in turn (with error='strict' ) until it hits one that doesn't throw an exception.要将数据转换为字符串,我使用了一个循环,该循环遍历“可能”编码列表并依次尝试每个编码(使用error='strict' ),直到遇到不引发异常的编码。 This is working.这是有效的。

For #2: Once I've got the decoded string I just call the set() method for the Tkinter Label's textvariable .对于#2:一旦我获得了解码的字符串,我只需为 Tkinter 标签的textvariable调用set()方法。 This is also working.这也有效。

For #3: I'm opening an output file in the usual way and calling the write() method to write the decoded string.对于 #3:我以通常的方式打开一个输出文件并调用write()方法来写入解码后的字符串。 This works when the string was decoded as ascii or cp1252, but when it's decoded as utf-8 it throws an exception:这在字符串被解码为 ascii 或 cp1252 时有效,但当它被解码为 utf-8 时,它会抛出异常:

'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

I've searched around and found fairly similar questions but nothing that seems to address this specific problem.我四处搜索并发现了相当相似的问题,但似乎没有解决这个特定问题的问题。 Some further complications that restrict the solutions that will work for me:一些进一步的复杂性限制了对我有用的解决方案:

A. I can sidestep the problem just by leaving the read-in data as bytes and opening/writing the output file as binary, but this renders some of the input file contents unreadable.答:我可以通过将读入数据保留为字节并将输出文件作为二进制文件打开/写入来回避这个问题,但这会导致某些输入文件内容不可读。

B. Although this app is mainly intended for Python 3, I'm trying to make it cross-compatible with Python 2 -- we have some slow/late adopters who will be using it. B. 虽然这个应用程序主要是为 Python 3 设计的,但我正在努力让它与 Python 2 交叉兼容——我们有一些缓慢/较晚的采用者将使用它。 (BTW, when I run the app on Python 2 it also throws exceptions but does so for both the cp1252 data and the utf-8 data.) (顺便说一句,当我在 Python 2 上运行该应用程序时,它也会引发异常,但对 cp1252 数据和 utf-8 数据都如此。)


For the sake of illustrating the issue, I've created this stripped-down test script.为了说明这个问题,我创建了这个精简的测试脚本。 (My real application is a much larger project, and it's also proprietary to my company -- so it's not getting posted publicly!) (我的实际应用程序是一个更大的项目,它也是我公司的专有项目——所以它不会公开发布!)

import tkinter as tk
import codecs

#Root window
root = tk.Tk()

#Widgets
ctrlViewFile1 = tk.StringVar()
ctrlViewFile2 = tk.StringVar()
ctrlViewFile3 = tk.StringVar()
lblViewFile1 = tk.Label(root, relief=tk.SUNKEN,
                        justify=tk.LEFT, anchor=tk.NW,
                        width=10, height=3,
                        textvariable=ctrlViewFile1)
lblViewFile2 = tk.Label(root, relief=tk.SUNKEN,
                        justify=tk.LEFT, anchor=tk.NW,
                        width=10, height=3,
                        textvariable=ctrlViewFile2)
lblViewFile3  = tk.Label(root, relief=tk.SUNKEN,
                         justify=tk.LEFT, anchor=tk.NW,
                         width=10, height=3,
                         textvariable=ctrlViewFile3)

#Layout
lblViewFile1.grid(row=0,column=0,padx=5,pady=5)
lblViewFile2.grid(row=1,column=0,padx=5,pady=5)
lblViewFile3.grid(row=2,column=0,padx=5,pady=5)

#Bytes read from "files" (ascii Az5, cp1252 European letters/punctuation, utf-8 Mandarin characters)
inBytes1 = b'\x41\x7a\x35'
inBytes2 = b'\xe0\xbf\xf6'
inBytes3 = b'\xef\xbb\xbf\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e'

#Decode
outString1 = codecs.decode(inBytes1,'ascii','strict')
outString2 = codecs.decode(inBytes2,'cp1252','strict')
outString3 = codecs.decode(inBytes3,'utf_8','strict')

#Assign stringvars
ctrlViewFile1.set(outString1)
ctrlViewFile2.set(outString2)
ctrlViewFile3.set(outString3)

#Write output files
try:
    with open('out1.txt','w') as outFile:
        outFile.write(outString1)
except Exception as e:
    print(inBytes1)
    print(str(e))

try:
    with open('out2.txt','w') as outFile:
        outFile.write(outString2)
except Exception as e:
    print(inBytes2)
    print(str(e))

try:
    with open('out3.txt','w') as outFile:
        outFile.write(outString3)
except Exception as e:
    print(inBytes3)
    print(str(e))

#Start GUI
tk.mainloop()

I understand you want two things:我知道你想要两件事:

  • a way to write arbitrary Unicode characters to a file, and一种将任意 Unicode 字符写入文件的方法,以及
  • Python 2/3 compatibility. Python 2/3 兼容性。

Using open('out1.txt','w') violates both:使用open('out1.txt','w')违反了两者:

  • The output text stream is opened with a default encoding, which happens to be CP-1252 on your platform (apparently Windows).输出文本流以默认编码打开,在您的平台(显然是 Windows)上恰好是 CP-1252。 This codec supports only a subset of Unicode, eg.此编解码器仅支持 Unicode 的一个子集,例如。 lacking all emojis.缺少所有表情符号。
  • The open function differs considerably between Python versions. Python 版本之间的open函数差异很大。 In Python 3, it is the io.open function, which offers a lot of flexibility, such as specifying a text encoding.在 Python 3 中,它是io.open函数,它提供了很大的灵活性,例如指定文本编码。 In Python 2, the returned file handle processes 8-bit strings rather than Unicode strings (text).在 Python 2 中,返回的文件句柄处理 8 位字符串而不是 Unicode 字符串(文本)。
  • There's also a portability issue of which you might not be aware: the default encoding for IO is platform dependent, ie.还有一个您可能不知道的可移植性问题:IO 的默认编码是平台相关的,即。 people running your code might see a different default depending on OS and localisation.运行您的代码的人可能会看到不同的默认值,具体取决于操作系统和本地化。

You can avoid all this with io.open('out1.txt', 'w', encoding='utf8') :您可以使用io.open('out1.txt', 'w', encoding='utf8')避免所有这些:

  • Use an encoding that supports all characters needed.使用支持所有所需字符的编码。 Using the detected input encoding should work, unless processing introduces characters outside the supported range.使用检测到的输入编码应该可以工作,除非处理引入了支持范围之外的字符。 Using one of the UTF codecs will always work, with UTF-8 being the most widely used for text files.使用其中一种 UTF 编解码器将始终有效,其中 UTF-8 最广泛用于文本文件。 Note that some Windows apps (like Notepad) tend not to understand UTF-8.请注意,某些 Windows 应用程序(如记事本)往往不理解 UTF-8。
  • The io module was backported to Python 2.7. io模块被反向移植到 Python 2.7。 This generally qualifies as Py2/3 compatible, since support for versions <= 2.6 has ended quite some time ago.这通常符合 Py2/3 兼容,因为对版本 <= 2.6 的支持已经结束很久了。
  • Be explicit about the encoding used whenever opening text files.打开文本文件时要明确使用的编码。 There might be scenarios where the platform-dependent default encoding makes sense, but usually you want control.可能存在依赖于平台的默认编码有意义的场景,但通常您需要控制。

Side note: You mention a simple heuristic for detecting the input codec.旁注:您提到了一种用于检测输入编解码器的简单启发式方法。 If there's really no way to obtain this information, you should consider using chardet .如果真的没有办法获得这些信息,你应该考虑使用chardet

Be explicit.明确一点。 You've opened for write using a default encoding.您已使用默认编码打开写入。 Whatever it is, it doesn't support all Unicode code points.不管它是什么,它都不支持所有的 Unicode 代码点。 Open the file with UTF-8 encoding, which does support all Unicode code points:打开与UTF-8编码,它支持所有Unicode代码点的文件:

import io
with io.open('out3.txt','w',encoding='utf8') as outFile:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM