new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',' ').replace(' ', ' ').replace(' ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ')
This is the error produced by the code:
new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 -
I have tried decoding ascii from unicode.
You are calling .replace
on a unicode
object but giving str
arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).
To avoid this problem do not mix str
and unicode
. Either pass unicode arguments to unicode methods:
new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...
or do the replacements in the str
object, assuming text
is a str
:
new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...
The last piece of your chained replaces seems to be the problem.
text.replace('0xc3', ' ')
THis will try to replace the bytes 0xc3
with a space. In your code snippet it effectively reads
text.decode('utf-8').replace('0xc3', ' ')
which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:
text.replace('0xc3', ' ').decode('utf-8')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.