简体   繁体   English

将unicode字符串转换为utf-8

[英]Converting unicode string to utf-8

Firstly, I am aware that there are tons of questions regarding en/de-coding of strings in Python 2.x, but I can't seem to find a solution to this problem. 首先,我知道关于Python 2.x中的字符串的en / de-coding的问题很多,但是我似乎找不到解决此问题的方法。

I have a unicode string, that contains letter č which is represented as \Ä\ 我有一个unicode字符串,其中包含字母č ,它表示为\Ä\

If in Python console I write 如果在Python控制台中我写

>>> a = u"\u00c4\u008d"
>>> print a

I get two strange characters printed out instead of č , probably because the actual encoding of that string is supposed to be UTF-8. 我打印了两个奇怪的字符,而不是č ,这可能是因为该字符串的实际编码应该是UTF-8。 Therefore I try to use .decode("utf-8") but for this I get the standard UnicodeEncodeError . 因此,我尝试使用.decode("utf-8")但是为此,我得到了标准的UnicodeEncodeError

Do you know how I can make Python print that string as č in the console? 您知道我如何让Python在控制台中将该字符串打印为č吗?

Thanks a lot 非常感谢

č is not represented by u'\Ä\' . č不以u'\Ä\' Those two hex values are the UTF-8-encoded values, so should be written in a byte string as '\\xc4\\x8d' . 这两个十六进制值是UTF-8编码的值,因此应在字节字符串中写为'\\xc4\\x8d' Example: 例:

>>> s = '\xc4\x8d'
>>> s.decode('utf8')
u'\u010d'
>>> print(s.decode('utf8'))
č

Caveat: Your terminal must be configured with an encoding that supports the character to print correctly, or you will see a UnicodeEncodeError . 警告:您的终端必须配置有支持字符的编码才能正确打印,否则您将看到UnicodeEncodeError

If for some reason you have a mis-decoded Unicode string, you can take advantage of the fact that the first 256 code points of Unicode correlate to the latin1 encoding and fix it: 如果由于某种原因您的Unicode字符串解码错误,则可以利用Unicode的前256个代码点与latin1编码相关联并对其进行修复的事实:

>>> s = u'\u00c4\u008d'
>>> s.encode('latin1')
'\xc4\x8d'
>>> s.encode('latin1').decode('utf8')
u'\u010d'
>>> print(s.encode('latin1').decode('utf8'))
č

If you have a mis-decoded Unicode string, you should show the file you have or the code that read it and solve that problem. 如果您有一个错误解码的Unicode字符串,则应显示您拥有的文件或读取该文件的代码并解决该问题。

After fighting with python for over an hour, I decided to look for a solution in another language. 与python战斗了一个多小时后,我决定寻找另一种语言的解决方案。 This is how my goal can be achieved in C#: 这就是我在C#中可以实现的目标:

var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);

Finally! 最后! The file now contains č . 文件现在包含č

I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python: 因此,我受到了C#中这种方法的启发,并设法提出了以下(看似)等效的Python解决方案:

>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č

I'm not sure how good this solution is but it seems to work in my case. 我不确定这个解决方案有多好,但是对于我来说似乎可行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM