converting python source code to utf-8

Question

Context: I am using Google's App Engine (in Python) to connect to Wikipedia's API. I then get a json file that I use for display on a webpage. It is working OK but I am having issues with accentuated/non-Latin characters.

Actual issue: When I query "Nikola Tesla", his name in Cyrillic comes across as python source code instead of utf-8: \Н\и\к\о\л\а \Т\е\с\л\а .

As a result, the python source code doesn't get read properly and his name on the webpage shows as \Н\и\к\о\л\а \Т\е\с\л\а instead of Никола Тесла.

How could I convert this python source code into valid utf-8: \\xD0\\x9D\\xD0\\xB8\\xD\\xBA\\xD0\\xBE\\xD0\\xBB\\xD0\\xB0

Other than me painstakingly googling individual characters, that is...

Thank you

Answer 1

There is difference between u"\Н" (with u prefix) and "\Н" (without u prefix) - first is treated by Python as unicode string with unicode letter \Н , second is treated as non-unicode string so Python doesn't recognize \Н as unicode letter.

if you have text without prefix (and you can't add prefix manually) then you have to convert it to correct unicode string using decode('unicode-escape')

"\u041d".decode('unicode-escape') 

u"\u041d"

If you got correct unicode string then you can convert it to "UTF-8"

u"\u041d".encode('utf-8')

'\xd0\x9d'

-

I use strings in examples but you can use variables with strings.

my_string = "\u041d\u0438\u043a\u043e\u043b\u0430" 

new_string = my_string.decode('unicode-escape') 

new_string = new_string.encode('utf-8') 

print new_string

Никола

converting python source code to utf-8

Question

1 answers

solution1
0 ACCPTED 2016-01-20 22:22:48

converting python source code to utf-8

Question

1 answers

solution1 0 ACCPTED 2016-01-20 22:22:48

solution1
0 ACCPTED 2016-01-20 22:22:48