简体   繁体   中英

How to read ® character from Windows-1252 file and write to UTF-8 file

I have an input file in Windows-1252 encoding that contains the '®' character. I need to write this character to a UTF-8 file. Also assume I must use Python 2.7. Seems easy enough, but I keep getting UnicodeDecodeErrors.

I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043: 
invalid start byte

I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. But that produced a new error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: 
ordinal not in range(128)

Here is a minimum working example:

with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
    with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
        for line in inf:
            of.write(line.encode('utf-8'))

Here is the contents of in.txt :

Sample file

Here is my sample file® yay.

I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so:

of.write(line.decode('cp1252').encode('utf-8'))

But that also didn't work, giving the same error as when I just opened it as UTF-8.

How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? The above method has always worked for me in the past until I encountered the ® character.

Your file is not in Windows-1252 if 0xC2 should represent the ® character; in Windows-1252, 0xC2 is  .

However, you should just use

of.write(line)

since encoding properly is the whole reason you're using codecs in the first place.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM