简体   繁体   English

在Python中将UTF-8字符串转换为字符串

[英]Convert a UTF-8 String to a string in Python

If I have a unicode string such as: 如果我有一个unicode字符串,例如:

s = u'c\r\x8f\x02\x00\x00\x02\u201d'

how can I convert this to just a regular string that isn't in unicode format; 如何将其转换为非unicode格式的常规字符串; ie I want to extract: 即我想提取:

f = '\x00\x00\x02\u201d'

and I do not want it in unicode format. 我不希望它以unicode格式。 The reason why I need to do this is because I need to convert the unicode in s to an integer value, but if I try it with just s: 我需要这样做的原因是因为我需要将s中的unicode转换为整数值,但如果我只用s来尝试:

int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)

Traceback (most recent call last):
  File "<pyshell#48>", line 1, in <module>
    int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
  File "C:\Python27\lib\encodings\hex_codec.py", line 24, in hex_encode
    output = binascii.b2a_hex(input)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 3: ordinal not in range(128)

yet if I do it with f: 但如果我用f做:

int(f.encode('hex'), 16)
664608376369508L

And this is the correct integer value I want to extract from s. 这是我想从s中提取的正确整数值。 Is there a method where I can do this? 有没有办法可以做到这一点?

Normally, the device sends back something like: \\x00\\x00\\x03\\xcc which I can easily convert to 972 通常,设备会发回类似:\\ x00 \\ x00 \\ x03 \\ xcc,我可以轻松转换为972

OK, so I think what's happening here is you're trying to read four bytes from a byte-oriented device, and decode that to an integer, interpreting the bytes as a 32-bit word in big-endian order. 好的,所以我认为这里发生的事情是你试图从面向字节的设备读取四个字节,并将其解码为一个整数,将字节解释为big-endian顺序的32位字。

To do this, use the struct module and byte strings: 为此,请使用struct模块和字节字符串:

>>> struct.unpack('>i', '\x00\x00\x03\xCC')[0]
972

(I'm not sure why you were trying to reverse the string then hex-encode; that would put the bytes in the wrong order and give much too large output.) (我不确定你为什么要尝试反转字符串然后进行十六进制编码;这会使字节的顺序错误并输出太大的输出。)

I don't know how you're reading from the device, but at some point you've decoded the bytes into a text (Unicode) string. 我不知道你是如何从设备读取的,但在某些时候你已经将字节解码为文本(Unicode)字符串。 Judging from the U+201D character in there I would guess that the device originally gave you a byte 0x94 and you decoded it using code page 1252 or another similar Windows default ('ANSI') code page. 从那里的U + 201D字符判断,我猜这个设备最初给你一个字节0x94,你用代码页1252或其他类似的Windows默认('ANSI')代码页解码它。

>>> struct.unpack('>i', '\x00\x00\x02\x94')[0]
660

It may be possible to reverse the incorrect decoding step by encoding back to bytes using the same mapping, but this is dicey and depends on which encoding are involved (not all bytes are mapped to anything usable in all encodings). 可以通过使用相同的映射编码回到字节来反转不正确的解码步骤,但这很冒险并且取决于涉及哪种编码(并非所有字节都映射到在所有编码中可用的任何字节)。 Better would be to look at where the input is coming from, find where that decode step is happening, and get rid of it so you keep hold of the raw bytes the device sent you. 最好是查看输入的来源,找到解码步骤发生的位置,然后摆脱它,以便保持设备发送给你的原始字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM