简体   繁体   English

Python读取非ascii文本文件

[英]Python read non-ascii text file

I am trying to load a text file, which contains some German letters with 我正在尝试加载一个文本文件,其中包含一些德语字母

content=open("file.txt","r").read() 

which results in this error message 这会导致此错误消息

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

if I modify the file to contain only ASCII characters everything works as expected. 如果我修改文件只包含ASCII字符,一切都按预期工作。

Apperently using 好好用

content=open("file.txt","rb").read() 

or 要么

content=open("file.txt","r",encoding="utf-8").read()

both do the job. 都做好了。

Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding? 为什么可以用“二进制”模式读取并获得与utf-8编码相同的结果?

In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. 在Python 3中,使用'r'模式而不指定编码只使用默认编码,在本例中为ASCII。 Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters. 使用'rb'模式将文件作为字节读取,并且不会尝试将其解释为字符串。

ASCII is limited to characters in the range of [0,128). ASCII仅限于[0,128]范围内的字符。 If you try to decode a byte that is outside that range, one gets that error. 如果您尝试解码超出该范围的字节,则会收到该错误。

When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). 当您以字节为单位读取字符串时,您将“可接受的字符范围”“扩展”为[0,256]。 So your \\0xc3 character à is now read in without error. 所以你的\\ 0xc3字符Ã现在被读入而没有错误。 But despite it seeming to work, it's still not "correct". 但尽管它似乎有效,但它仍然不是“正确的”。

If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes. 如果您的字符串确实是unicode编码的,则存在一种可能包含多字节字符的可能性,即字节表示实际跨越多个字节的字符。

It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent. 在这种情况下,将文件作为字节串读取并正确解码它之间的区别将非常明显。

A character like this: č 像这样的人物:č

Will be read in as two bytes, but properly decoded, will be one character: 将作为两个字节读入,但正确解码,将是一个字符:

bytes = bytes('č', encoding='utf-8')

print(len(bytes))                   # 2
print(len(bytes.decode('utf-8')))   # 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM