简体   繁体   中英

Python: Encoding issues?

in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem. But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).

What's the problem here, and how can I handle this?

Thanks a lot for any help!

EDIT: At the moment, I use the following: output.write(text.decode("utf-8")) This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\\xc2\\xadrung How can we solve this problem? Thanks a lot!

There is U+00AD SOFT HYPHEN before r in the string:

>>> "Schilde­rung".decode('utf-8')
u'Schilde\xadrung'

To remove non-ascii characters:

>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Seems like "r" isn't ASCII:

>>> u'Schilde­rung'
u'Schilde\xadrung'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM