Context: I am using Google's App Engine (in Python) to connect to Wikipedia's API. I then get a json file that I use for display on a webpage. It is working OK but I am having issues with accentuated/non-Latin characters.
Actual issue: When I query "Nikola Tesla", his name in Cyrillic comes across as python source code instead of utf-8: \Н\и\к\о\л\а \Т\е\с\л\а
.
As a result, the python source code doesn't get read properly and his name on the webpage shows as \Н\и\к\о\л\а \Т\е\с\л\а
instead of Никола Тесла.
How could I convert this python source code into valid utf-8: \\xD0\\x9D\\xD0\\xB8\\xD\\xBA\\xD0\\xBE\\xD0\\xBB\\xD0\\xB0
Other than me painstakingly googling individual characters, that is...
Thank you
There is difference between u"\Н"
(with u
prefix) and "\Н"
(without u
prefix) - first is treated by Python as unicode string with unicode letter \Н
, second is treated as non-unicode string so Python doesn't recognize \Н
as unicode letter.
if you have text without prefix (and you can't add prefix manually) then you have to convert it to correct unicode string using decode('unicode-escape')
"\u041d".decode('unicode-escape')
u"\u041d"
If you got correct unicode string then you can convert it to "UTF-8"
u"\u041d".encode('utf-8')
'\xd0\x9d'
-
I use strings in examples but you can use variables with strings.
my_string = "\u041d\u0438\u043a\u043e\u043b\u0430"
new_string = my_string.decode('unicode-escape')
new_string = new_string.encode('utf-8')
print new_string
Никола
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.