简体   繁体   中英

how to encode unicode characters in python?

i am trying to encode unicode characters to a specific format that can be encoded and sent to a url using Python 2.

input = u"í"

required_output = "%CC%81"

import urllib

print urllib.quote('í') = "%C3%AD"

Is there a way to encode as shown so i can get the required output?

>>> import unicodedata, urllib
>>> urllib.quote(unicodedata.normalize("NFD", u"í").encode('utf8'))
'i%CC%81'

You encoded the U+00ED LATIN SMALL LETTER I WITH ACUTE , and not the separate i ASCII letter with the combining acute character that would encode to CC 81, so U+0301 COMBINING ACUTE ACCENT .

If your input data is combined , you'd have to decompose it to NFD or NFKD normal forms:

normalized = unicodedata.normalize("NFD", input).encode('utf8')
print urllib.quote(normalized)

See the Wikipedia Unicode article section on normal forms .

Generally speaking, for a URL , you should really stick to the NFC normal form! A Internationalized Resource Identifier (IRI) , which allows non-ASCII data, is converted to a URL by using the NFC normal form, so %C3%AD is the correct form, not i%CC%81 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM