Converting unicode string to utf-8

Question

Firstly, I am aware that there are tons of questions regarding en/de-coding of strings in Python 2.x, but I can't seem to find a solution to this problem.

I have a unicode string, that contains letter č which is represented as \Ä\

If in Python console I write

>>> a = u"\u00c4\u008d"
>>> print a

I get two strange characters printed out instead of č , probably because the actual encoding of that string is supposed to be UTF-8. Therefore I try to use .decode("utf-8") but for this I get the standard UnicodeEncodeError .

Do you know how I can make Python print that string as č in the console?

Thanks a lot

Answer 1

č is not represented by u'\Ä\' . Those two hex values are the UTF-8-encoded values, so should be written in a byte string as '\\xc4\\x8d' . Example:

>>> s = '\xc4\x8d'
>>> s.decode('utf8')
u'\u010d'
>>> print(s.decode('utf8'))
č

Caveat: Your terminal must be configured with an encoding that supports the character to print correctly, or you will see a UnicodeEncodeError .

If for some reason you have a mis-decoded Unicode string, you can take advantage of the fact that the first 256 code points of Unicode correlate to the latin1 encoding and fix it:

>>> s = u'\u00c4\u008d'
>>> s.encode('latin1')
'\xc4\x8d'
>>> s.encode('latin1').decode('utf8')
u'\u010d'
>>> print(s.encode('latin1').decode('utf8'))
č

If you have a mis-decoded Unicode string, you should show the file you have or the code that read it and solve that problem.

Answer 2

After fighting with python for over an hour, I decided to look for a solution in another language. This is how my goal can be achieved in C#:

var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);

Finally! The file now contains č .

I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python:

>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č

I'm not sure how good this solution is but it seems to work in my case.

Converting unicode string to utf-8

Question

2 answers

solution1
3 2018-04-24 16:29:38

solution2
0 ACCPTED 2018-04-24 15:37:02

Converting unicode string to utf-8

Question

2 answers

solution1 3 2018-04-24 16:29:38

solution2 0 ACCPTED 2018-04-24 15:37:02

solution1
3 2018-04-24 16:29:38

solution2
0 ACCPTED 2018-04-24 15:37:02