[英]Converting unicode string to utf-8
Firstly, I am aware that there are tons of questions regarding en/de-coding of strings in Python 2.x, but I can't seem to find a solution to this problem. 首先,我知道关于Python 2.x中的字符串的en / de-coding的问题很多,但是我似乎找不到解决此问题的方法。
I have a unicode string, that contains letter č
which is represented as \Ä\
我有一个unicode字符串,其中包含字母
č
,它表示为\Ä\
If in Python console I write 如果在Python控制台中我写
>>> a = u"\u00c4\u008d"
>>> print a
I get two strange characters printed out instead of č
, probably because the actual encoding of that string is supposed to be UTF-8. 我打印了两个奇怪的字符,而不是
č
,这可能是因为该字符串的实际编码应该是UTF-8。 Therefore I try to use .decode("utf-8")
but for this I get the standard UnicodeEncodeError
. 因此,我尝试使用
.decode("utf-8")
但是为此,我得到了标准的UnicodeEncodeError
。
Do you know how I can make Python print that string as č
in the console? 您知道我如何让Python在控制台中将该字符串打印为
č
吗?
Thanks a lot 非常感谢
č
is not represented by u'\Ä\'
. č
不以u'\Ä\'
。 Those two hex values are the UTF-8-encoded values, so should be written in a byte string as '\\xc4\\x8d'
. 这两个十六进制值是UTF-8编码的值,因此应在字节字符串中写为
'\\xc4\\x8d'
。 Example: 例:
>>> s = '\xc4\x8d'
>>> s.decode('utf8')
u'\u010d'
>>> print(s.decode('utf8'))
č
Caveat: Your terminal must be configured with an encoding that supports the character to print correctly, or you will see a UnicodeEncodeError
. 警告:您的终端必须配置有支持字符的编码才能正确打印,否则您将看到
UnicodeEncodeError
。
If for some reason you have a mis-decoded Unicode string, you can take advantage of the fact that the first 256 code points of Unicode correlate to the latin1
encoding and fix it: 如果由于某种原因您的Unicode字符串解码错误,则可以利用Unicode的前256个代码点与
latin1
编码相关联并对其进行修复的事实:
>>> s = u'\u00c4\u008d'
>>> s.encode('latin1')
'\xc4\x8d'
>>> s.encode('latin1').decode('utf8')
u'\u010d'
>>> print(s.encode('latin1').decode('utf8'))
č
If you have a mis-decoded Unicode string, you should show the file you have or the code that read it and solve that problem. 如果您有一个错误解码的Unicode字符串,则应显示您拥有的文件或读取该文件的代码并解决该问题。
After fighting with python for over an hour, I decided to look for a solution in another language. 与python战斗了一个多小时后,我决定寻找另一种语言的解决方案。 This is how my goal can be achieved in C#:
这就是我在C#中可以实现的目标:
var s = "\u00c4\u008d";
var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);
Finally! 最后! The file now contains
č
. 文件现在包含
č
。
I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python: 因此,我受到了C#中这种方法的启发,并设法提出了以下(看似)等效的Python解决方案:
>>> s = u"\u00c4\u008d"
>>> arr = bytearray(map(ord, s))
>>> print arr.decode("utf-8")
č
I'm not sure how good this solution is but it seems to work in my case. 我不确定这个解决方案有多好,但是对于我来说似乎可行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.