简体   繁体   中英

Python UnicodeDecodeError on Mac, but not on PC?

I've got a script that basically aggregates students' code files into one file for plagiarism detection. It walks through a tree of files, copying all file contents into one file.

I've run the script on the exact same files on my Mac and my PC. On my PC, it works fine. On my Mac, it encounters 27 UnicodeDecodeErrors (probably 0.1% of all files I'm testing).

What could cause a UnicodeDecodeError on a Mac, but not on a PC?

If relevant, the code is:

originalFile = open(originalFilename, "r")
newFile = open(newFilename, "a")
newFile.write(originalFile.read())

Figure out what encoding was used when saving that file. A safe bet is loading the file as 'utf-8' . If that succeeds then it's likely to be the correct encoding.

# try utf-8. If this fails, all bets are off.
open(originalFilename, "r", encoding="utf-8")

Now, if students are sending you these files, it's likely they just use the default encoding on their system. It is not possible to reliably guess the encoding. If they were using an 8-bit codec, like one of the ISO-8859 character sets, it will be almost impossible to guess which one was used. What to do then depends on what kind of files you're processing.

It is incorrect to read Python source files using open(originalFilename, "r") on Python 3. open() uses locale.getpreferredencoding(False) by default. A Python source may use a different character encoding; in the best case, it may cause UnicodeDecodeError -- usually, you just get a mojibake silently.

To read Python source taking into account the encoding declaration ( # -*- coding: ... ), use tokenize.open(filename) . If it fails; the input is not valid Python 3 source code.

What could cause a UnicodeDecodeError on a Mac, but not on a PC?

locale.getpreferredencoding(False) is likely to be utf-8 on Mac. utf-8 doesn't accept arbitrary sequence of bytes as utf-8 encoded text. PC is likely to use a 8-bit character encoding that corrupts the input and produces a mojibake silently instead of raising an error due to a mismatched character encoding.

To read a text file, you should know its character encoding. If you don't know the character encoding then either read the file as a sequence of bytes ( 'rb' mode) or you could try to guess the encoding using chardet Python module (it would be only a guess but it might be good enough depending on your task).

I got the exact same problem. There seemed to be some characters in the file that gave a UnicodeDecodeError during readlines() This only happened on my macbook, but not on a PC.

I solve the problem by simply skipping these characters:

with open(file_to_extract, errors='ignore') as f: reader = f.readlines()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM