[英]Reading a binary file in read mode Python 3 - passes on Windows, fails on Linux
I am executing this piece of code against 我正在执行这段代码
Python on Windows Windows上的Python
'3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]'
and 和
Python on Linux Linux上的Python
'3.6.6 (default, Mar 29 2019, 00:03:27) \\n[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]'
The code writes some bytes into a file using wb
mode and then reads it as r
plain text. 该代码使用
wb
模式将一些字节写入文件,然后将其读取为r
纯文本。 I understand that I should be reading as bytes ( rb
), but I am curious why does it break on Linux while passing on Windows? 我知道我应该以字节(
rb
)的形式读取,但是我很好奇为什么在Windows上传递时它在Linux上会中断?
import os
import tempfile
temp_dir = tempfile.mkdtemp()
temp_file = os.path.join(temp_dir, 'write_file')
expected_bytes = bytearray([123, 3, 255, 0, 100])
with open(temp_file, 'wb') as fh:
fh.write(expected_bytes)
with open(temp_file, 'r', newline='') as fh:
actual = fh.read()
Exception raised on Linux: 在Linux上引发的异常:
Traceback (most recent call last):
File "<input>", line 11, in <module>
File "/home/.../lib64/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
Getting default system encoding (with sys.getdefaultencoding()
) shows 'utf-8'
on both machines. 获取默认系统编码(使用
sys.getdefaultencoding()
)在两台计算机上均显示'utf-8'
。
When opening a file in text mode, so with 'rt'
(where both 'r' and 't' are the default), everything you read from the file gets transparently decoded on the fly and returned as str
objects, as explained in Text I/O . 当以文本模式打开文件时,使用
'rt'
(默认为“ r”和“ t”)时,从文件中读取的所有内容都会进行即时透明解码并作为str
对象返回,如Text中所述输入/输出
You can force the encoding to use when opening the file, like: 您可以在打开文件时强制使用编码,例如:
f = open("myfile.txt", "r", encoding="utf-8")
As explained in the documenation for open : 如open文档中所述:
The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used.
默认编码取决于平台(无论locale.getpreferredencoding()返回什么),但是可以使用Python支持的任何文本编码。 See the codecs module for the list of supported encodings.
有关支持的编码列表,请参见编解码器模块。
(Note that sys.getdefaultencoding()
is something unrelated: it returns the name of the current default string encoding used by the Unicode implementation) (请注意,
sys.getdefaultencoding()
是无关的:它返回Unicode实现使用的当前默认字符串编码的名称。)
As you stated in the comments, on your system, locale.getpreferredencoding()
gives 'cp1252' on Windows and 'UTF-8' on Linux. 如注释中所述,在系统上,
locale.getpreferredencoding()
在Windows上为'cp1252',在Linux上为'UTF-8'。
CP-1252 is a single byte encoding in which each byte corresponds to a character. CP-1252是单字节编码,其中每个字节对应一个字符。 So, whatever file you read, the data it contains can be turned into a string.
因此,无论您读取什么文件,它包含的数据都可以转换为字符串。
UTF-8 , though, uses a variable width encoding in which not all sequences of bytes are valid and represent a character. 但是, UTF-8使用可变宽度编码,其中并非所有字节序列都有效并且代表字符。 That's why trying to read your file on your Linux system failed when some byte couldn't be decoded.
这就是为什么无法解码某些字节时尝试在Linux系统上读取文件失败的原因。
If you have written the file out as bytes, you should read it in as bytes. 如果已将文件写为字节,则应以字节读入。
f = open("myfile.txt", "rb")
If you read it in as text (using "r"
or "rt"
) then an attempt will be made to decode it into Unicode. 如果以文本形式(使用
"r"
或"rt"
)将其读取,则将尝试将其解码为Unicode。 What encoding is used by default is platform-dependent. 默认情况下使用的编码取决于平台。 But you clearly don't want it decoded at all.
但是您显然根本不希望将其解码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.