在python中将字符串转换为unicode类型

Question

I'm trying this code: 我正在尝试这段代码：

s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))

but this error occurs: 但是会发生以下错误：

 '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 0: ordinal not in range(128) UnicodeDecodeError：'ascii'编解码器无法解码位置0中的字节0xd3：序数不在范围内（128）

I tried '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) but nothing changed. 我试过'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))但没有改变。

what should I do? 我该怎么办？

Answer 1

Since you're using python 2, s = "سلام" is a byte string (in whatever encoding your terminal uses, presumably utf8): 由于您使用的是python 2，因此s = "سلام"是一个字节字符串（无论您的终端使用什么编码，大概是utf8）：

>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

You cannot encode byte strings (as they are already "encoded"). 您不能encode字节字符串（因为它们已经“编码”）。 You're looking for unicode ("real") strings, which in python2 must be prefixed with u : 你正在寻找unicode（“真实”）字符串，在python2中必须以u为前缀：

>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'

If you're getting a byte string from a function such as raw_input then your string is already encoded - just skip the encode part: 如果您从诸如raw_input的函数获取字节字符串，那么您的字符串已经被编码 - 只需跳过encode部分：

'{:b}'.format(int(s.encode('hex'), 16))

or (if you're going to do anything else with it) convert it to unicode: 或者（如果你要用它做任何其他事情）将其转换为unicode：

s = s.decode('utf8')

This assumes that your input is UTF-8 encoded, if this might not be the case, check sys.stdin.encoding first. 这假设您的输入是UTF-8编码，如果情况可能不是这样，请首先检查sys.stdin.encoding 。

i10n stuff is complicated, here are two articles that will help you further: i10n的内容很复杂，这里有两篇文章可以帮助你进一步：

在python中将字符串转换为unicode类型

问题描述

1 个解决方案

解决方案1
7 已采纳 2013-10-08 21:27:23

在python中将字符串转换为unicode类型

问题描述

1 个解决方案

解决方案1 7 已采纳 2013-10-08 21:27:23

解决方案1
7 已采纳 2013-10-08 21:27:23