将unicode字符串转换为utf-8

Question

Firstly, I am aware that there are tons of questions regarding en/de-coding of strings in Python 2.x, but I can't seem to find a solution to this problem. 首先，我知道关于Python 2.x中的字符串的en / de-coding的问题很多，但是我似乎找不到解决此问题的方法。

I have a unicode string, that contains letter č which is represented as \Ä\ 我有一个unicode字符串，其中包含字母č ，它表示为\Ä\

If in Python console I write 如果在Python控制台中我写

>>> a = u"\u00c4\u008d"
>>> print a

I get two strange characters printed out instead of č , probably because the actual encoding of that string is supposed to be UTF-8. 我打印了两个奇怪的字符，而不是č ，这可能是因为该字符串的实际编码应该是UTF-8。 Therefore I try to use .decode("utf-8") but for this I get the standard UnicodeEncodeError . 因此，我尝试使用.decode("utf-8")但是为此，我得到了标准的UnicodeEncodeError 。

Do you know how I can make Python print that string as č in the console? 您知道我如何让Python在控制台中将该字符串打印为č吗？

Thanks a lot 非常感谢

Answer 1

č is not represented by u'\Ä\' . č不以u'\Ä\' 。 Those two hex values are the UTF-8-encoded values, so should be written in a byte string as '\\xc4\\x8d' . 这两个十六进制值是UTF-8编码的值，因此应在字节字符串中写为'\\xc4\\x8d' 。 Example: 例：

>>> s = '\xc4\x8d'
>>> s.decode('utf8')
u'\u010d'
>>> print(s.decode('utf8'))
č

Caveat: Your terminal must be configured with an encoding that supports the character to print correctly, or you will see a UnicodeEncodeError . 警告：您的终端必须配置有支持字符的编码才能正确打印，否则您将看到UnicodeEncodeError 。

If for some reason you have a mis-decoded Unicode string, you can take advantage of the fact that the first 256 code points of Unicode correlate to the latin1 encoding and fix it: 如果由于某种原因您的Unicode字符串解码错误，则可以利用Unicode的前256个代码点与latin1编码相关联并对其进行修复的事实：

>>> s = u'\u00c4\u008d'
>>> s.encode('latin1')
'\xc4\x8d'
>>> s.encode('latin1').decode('utf8')
u'\u010d'
>>> print(s.encode('latin1').decode('utf8'))
č

If you have a mis-decoded Unicode string, you should show the file you have or the code that read it and solve that problem. 如果您有一个错误解码的Unicode字符串，则应显示您拥有的文件或读取该文件的代码并解决该问题。

Answer 2

After fighting with python for over an hour, I decided to look for a solution in another language. 与python战斗了一个多小时后，我决定寻找另一种语言的解决方案。 This is how my goal can be achieved in C#: 这就是我在C＃中可以实现的目标：

var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);

Finally! 最后！ The file now contains č . 文件现在包含č 。

I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python: 因此，我受到了C＃中这种方法的启发，并设法提出了以下（看似）等效的Python解决方案：

>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č

I'm not sure how good this solution is but it seems to work in my case. 我不确定这个解决方案有多好，但是对于我来说似乎可行。

将unicode字符串转换为utf-8

问题描述

2 个解决方案

解决方案1
3 2018-04-24 16:29:38

解决方案2
0 已采纳 2018-04-24 15:37:02

将unicode字符串转换为utf-8

问题描述

2 个解决方案

解决方案1 3 2018-04-24 16:29:38

解决方案2 0 已采纳 2018-04-24 15:37:02

解决方案1
3 2018-04-24 16:29:38

解决方案2
0 已采纳 2018-04-24 15:37:02