简体   繁体   中英

utf-8 unicode error python

new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace('    ',' ').replace('   ', ' ').replace('  ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ')

This is the error produced by the code:

new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace('    ',
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 -

I have tried decoding ascii from unicode.

You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).

To avoid this problem do not mix str and unicode . Either pass unicode arguments to unicode methods:

new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...

or do the replacements in the str object, assuming text is a str :

new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...

The last piece of your chained replaces seems to be the problem.

text.replace('0xc3', ' ')

THis will try to replace the bytes 0xc3 with a space. In your code snippet it effectively reads

text.decode('utf-8').replace('0xc3', ' ')

which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:

text.replace('0xc3', ' ').decode('utf-8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM