简体   繁体   中英

converting python source code to utf-8

Context: I am using Google's App Engine (in Python) to connect to Wikipedia's API. I then get a json file that I use for display on a webpage. It is working OK but I am having issues with accentuated/non-Latin characters.

Actual issue: When I query "Nikola Tesla", his name in Cyrillic comes across as python source code instead of utf-8: \Н\и\к\о\л\а \Т\е\с\л\а .

As a result, the python source code doesn't get read properly and his name on the webpage shows as \Н\и\к\о\л\а \Т\е\с\л\а instead of Никола Тесла.

How could I convert this python source code into valid utf-8: \\xD0\\x9D\\xD0\\xB8\\xD\\xBA\\xD0\\xBE\\xD0\\xBB\\xD0\\xB0

Other than me painstakingly googling individual characters, that is...

Thank you

There is difference between u"\Н" (with u prefix) and "\Н" (without u prefix) - first is treated by Python as unicode string with unicode letter , second is treated as non-unicode string so Python doesn't recognize as unicode letter.

if you have text without prefix (and you can't add prefix manually) then you have to convert it to correct unicode string using decode('unicode-escape')

"\u041d".decode('unicode-escape') 

u"\u041d"

If you got correct unicode string then you can convert it to "UTF-8"

u"\u041d".encode('utf-8')

'\xd0\x9d'

-

I use strings in examples but you can use variables with strings.

my_string = "\u041d\u0438\u043a\u043e\u043b\u0430" 

new_string = my_string.decode('unicode-escape') 

new_string = new_string.encode('utf-8') 

print new_string

Никола

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM