[英]Python: Convert utf-8 string to byte string
我有以下函數來解析字節序列中的utf-8字符串
注 - 'length_size'是表示utf-8字符串長度所需的字節數
def parse_utf8(self, bytes, length_size):
length = bytes2int(bytes[0:length_size])
value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
return value
def bytes2int(raw_bytes, signed=False):
"""
Convert a string of bytes to an integer (assumes little-endian byte order)
"""
if len(raw_bytes) == 0:
return None
fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
if signed:
fmt = fmt.lower()
return struct.unpack('<'+fmt, raw_bytes)[0]
我想反向編寫函數 - 即一個函數將采用utf-8編碼的字符串並將其表示作為字節字符串返回。
到目前為止,我有以下內容:
def create_utf8(self, utf8_string):
return utf8_string.encode('utf-8')
嘗試測試時遇到以下錯誤:
File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
如果可能的話,我想采用類似於parse_utf8示例的代碼結構。 我究竟做錯了什么?
謝謝您的幫助!
更新:測試驅動程序,現在正確
def random_utf8_seq(self, length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u""
for i in range(length):
utf8_seq += random.choice(test_charset)
return utf8_seq
我收到以下錯誤:
input_str = self.random_utf8_seq(200)
File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
如果您想要utf-8 => bytestring轉換,那么您可以使用str.encode
,但首先需要在示例中正確標記源字符串的類型 - 使用u
作為unicode的前綴:
# coding: utf-8
import random
def random_utf8_seq(length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u''
for i in range(length):
utf8_seq += random.choice(test_charset)
print utf8_seq.encode('utf-8')
return utf8_seq.encode('utf-8')
print( type(random_utf8_seq(200)) )
- 輸出 -
õ3×sÔP{Ć.s(Ë°˙ě÷xÓ@bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä?
óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇnל¦Ì˝+•ì›
ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył"MťÆ‹ČČ4þ!»šåŮ@Öhň-
ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð
<type 'str'>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.