Replacing unencodable characters

Question

Im currently working on something where i need to pull some .xml from a website and work with it.

Everything is working fine, but if i try to print the .xml (or text after parsing it) and there is some character in the .xml that cant be encoded, i get that error:

return codecs.charmap_encode(input,self.errors,encoding_table)
[0]UnicodeEncodeError: 'charmap' codec can't encode character '\u2665' in position 1161: character maps to <undefined>

Now i want to locate these characters and replace them with a "?" for example.

How do i do this?

Is there a better method for handling these errors?

Answer 1

If you wrote the code that generated that error it would be easier to help you, in any case, usually, you can encode the string in utf8 and then do the decoding:

data = '\u2665'
data = data.encode('utf8')
print(data)  # b'\xe2\x99\xa5'
data_d = data.decode('utf8')
print(data_d)  # ♥

Moreover you can add this line at the beginning of your script:

# -*- coding: utf-8 -*-

and then verify the stdout.encoding with:

import sys
print(sys.stdout.encoding)

Replacing unencodable characters

Question

1 answers

solution1
0 2020-10-06 16:52:14

Replacing unencodable characters

Question

1 answers

solution1 0 2020-10-06 16:52:14

solution1
0 2020-10-06 16:52:14