in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).
What's the problem here, and how can I handle this?
Thanks a lot for any help!
EDIT: At the moment, I use the following: output.write(text.decode("utf-8"))
This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\\xc2\\xadrung How can we solve this problem? Thanks a lot!
There is U+00AD SOFT HYPHEN before r
in the string:
>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'
To remove non-ascii characters:
>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11
Seems like "r"
isn't ASCII:
>>> u'Schilderung'
u'Schilde\xadrung'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.