简体   繁体   中英

How to replace invalid unicode characters in a string in Python?

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with. So I end up with strings that contain characters that are non-unicode.

In order to correct these problems I need to display these strings somehow. Unfortunately I can not print them because they contain non-unicode characters. Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string?

My idea would be to process these strings character by character and check if the character stored is actually valid unicode. In case of an invalid character I would like to use a certain unicode symbol. But how can I do this? Using codecs seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array. Converting a string to byte array seems to involve decoding which will fail in my case of course. So it seems that I'm stuck.

Do you have an tips for me how to be able to create such a replacement string?

If you have a bytestring (undecoded data), use the 'replace' error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:

decoded_unicode = decoded_unicode.replace(u'\ufffd', '#')

Demo:

>>> bytestring = 'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'
>>> bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>> bytestring.decode('utf8', 'replace')
u'F\xf8\xf6\ufffdB\xe5r'
>>> print bytestring.decode('utf8', 'replace')
Føö�Bår

Thanks to you for your comments. This way I was able to implement a better solution:

    try:
        s2 = codecs.encode(s, "utf-8")
        return (True, s, None)
    except Exception as e:
        ret = codecs.decode(codecs.encode(s, "utf-8", "replace"), "utf-8")
        return (False, ret, e)

Please share any improvements on that solution. Thank you!

You have not given an example. Therefore, I have considered one example to answer your question.

x='This is a cat which looks good 😊'
print x
x.replace('😊','')

The output is:

This is a cat which looks good 😊
'This is a cat which looks good '

The right way to do it (at least in python2) is to use unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM