简体   繁体   中英

Python read non-ascii text file

I am trying to load a text file, which contains some German letters with

content=open("file.txt","r").read() 

which results in this error message

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

if I modify the file to contain only ASCII characters everything works as expected.

Apperently using

content=open("file.txt","rb").read() 

or

content=open("file.txt","r",encoding="utf-8").read()

both do the job.

Why is it possible to read with "binary" mode and get the same result as with utf-8 encoding?

In Python 3, using 'r' mode and not specifying an encoding just uses a default encoding, which in this case is ASCII. Using 'rb' mode reads the file as bytes and makes no attempt to interpret it as a string of characters.

ASCII is limited to characters in the range of [0,128). If you try to decode a byte that is outside that range, one gets that error.

When you read the string in as bytes, you're "widening" the acceptable range of character to [0,256). So your \\0xc3 character à is now read in without error. But despite it seeming to work, it's still not "correct".

If your strings are indeed unicode encoded, then the possibility exists that one will contain a multibyte character, that is, a character whose byte representation actually spans multiple bytes.

It is in this case where the difference between reading a file as a byte string and properly decoding it will be quite apparent.

A character like this: č

Will be read in as two bytes, but properly decoded, will be one character:

bytes = bytes('č', encoding='utf-8')

print(len(bytes))                   # 2
print(len(bytes.decode('utf-8')))   # 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM