Python: Encoding issues?

Question

in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).

What's the problem here, and how can I handle this?

Thanks a lot for any help!

EDIT: At the moment, I use the following: output.write(text.decode("utf-8")) This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\\xc2\\xadrung How can we solve this problem? Thanks a lot!

Answer 1

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Answer 2

Seems like "r" isn't ASCII:

>>> u'Schilderung'
u'Schilde\xadrung'

Python: Encoding issues?

Question

2 answers

solution1
1 2013-09-06 10:01:29

solution2
0 2013-09-06 09:50:41

Python: Encoding issues?

Question

2 answers

solution1 1 2013-09-06 10:01:29

solution2 0 2013-09-06 09:50:41

solution1
1 2013-09-06 10:01:29

solution2
0 2013-09-06 09:50:41