简体   繁体   中英

How to fix or remove malformed utf-8 characters in Python3

I have several text files that contain characters which python 3 is having trouble handling. The most troublesome seems to be "closing" quotation marks.

I have tried reading the files with:

with open(filename, 'r', errors='backslashreplace') as file:
    text = file.read()
with open(filename, 'w', errors='backslashreplace') as file:
    file.write(text)

and when opening the file in Notepad++ to view the characters, I get xE2 x80 highlighted to indicate a non-text character, followed by \\x9d in normal text.

I see that this deals with the \\xE2\\x80\\x9D character. In the python REPL I am able to manually create a bytes object like this, decode it as utf-8, and when printed it appears as the character that I expect. I am not sure why when reading the file the character is not understood correctly.

When reading the file to ignore errors, rather than backslashreplace , I still get the xE2 X80 characters appearing, and I have not figured out how to perform string operations to remove them.

Ultimately, my goal is to replace all of these strange quotes with normal quotes. There are several ways I can imagine accomplishing this, but they all require me to somehow address (or remove) the xE2 X80 character, or to correctly read the 3-byte \\xE2\\x80\\x9D character.

Specifying the encoding type should fix the issue. You can do so by doing,

with open(filename, 'r', encoding='utf8', errors='backslashreplace' ) as file:
    text = file.read()
with open(filename, 'w', encoding='utf8', errors='backslashreplace') as file:
    file.write(text)

To create a copy of the file omitting erroneous characters:

def sanitize_file(original_filename, sanitized_filename):
    with open(original_filename, 'r', encoding='utf8', errors='ignore') as original_file:
        with open(sanitized_filename, 'w', encoding='utf8') as sanitized_file:
            sanitized_file.write(original_file.read())

sanitize_file(filename, 'sanitized_' + filename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM