i am trying to encode unicode characters to a specific format that can be encoded and sent to a url using Python 2.
input = u"í"
required_output = "%CC%81"
import urllib
print urllib.quote('í') = "%C3%AD"
Is there a way to encode as shown so i can get the required output?
>>> import unicodedata, urllib
>>> urllib.quote(unicodedata.normalize("NFD", u"í").encode('utf8'))
'i%CC%81'
You encoded the U+00ED LATIN SMALL LETTER I WITH ACUTE , and not the separate i
ASCII letter with the combining acute character that would encode to CC 81, so U+0301 COMBINING ACUTE ACCENT .
If your input data is combined , you'd have to decompose it to NFD or NFKD normal forms:
normalized = unicodedata.normalize("NFD", input).encode('utf8')
print urllib.quote(normalized)
See the Wikipedia Unicode article section on normal forms .
Generally speaking, for a URL , you should really stick to the NFC normal form! A Internationalized Resource Identifier (IRI) , which allows non-ASCII data, is converted to a URL by using the NFC normal form, so %C3%AD
is the correct form, not i%CC%81
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.