[英]Convert bytes to string in python 3.6
I am trying to read and process a file. 我正在尝试读取和处理文件。 This wroks perfectly fine in Python2.7 but I can't get it working in Python 3. In Python 2.7 it works without providing any encoding whereas in Python 3 I have tried all combinations with and without encoding.
这在Python2.7中完全正常,但我无法在Python 3中使用它。在Python 2.7中,它无需提供任何编码即可工作,而在Python 3中,我尝试了有编码和无编码的所有组合。
After deep diving, I found that the way content returned by read
is different in both the versions. 深入研究后,我发现两个版本中
read
返回的内容的方式不同。
Code in Python 2.7 that works: 适用于Python 2.7的代码:
>>> f = open('resource.cgn', 'r')
>>> content = f.read()
>>> type(content)
<type 'str'>
>>> content[0:20]
'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> content[0]
'\x04'
However in Python 3: 但是在Python 3中:
>>> f = open('resource.cgn','r')
>>> content = f.read()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec cant decode byte 0xa0 in position 10: ordinal not in range(128)
>>> f = open('resource.cgn','rb')
>>> content = f.read()
>>> type(content)
<class 'bytes'>
>>> content[0:20]
b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> content[0]
4
>>> content.decode('utf8')
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10:
invalid start byte
I would like to get the same output as in Python 2.7. 我想获得与Python 2.7相同的输出。 The
content
should be of type string
and content[0]
should be str '\\x04'
and not int 4
content
应该是string
类型, content[0]
应该是str '\\x04'
而不是int 4
Any pointers on how can I get this? 关于如何获得此指示? I have tried encodings without any success.
我尝试了编码,但没有成功。
3.X's str
is now 2.X's unicode
by default and file objects opened in text mode in 3.X attempt to decode and encode when your files are read from or written to, respectively. 3.X的
str
现在默认为2.X的unicode
,并且在3.X中以文本模式打开的文件对象尝试分别在读取或写入文件时进行解码和编码。 str
of 2.X is now bytes
in 3.X. 2.X的
str
现在是3.X中的bytes
。 There's really very minor differences between 3.X bytes
and 2.X's str
both essentially hold 8-bit text. 3.X
bytes
和2.X的str
之间实际上只有很小的区别,它们基本上都保留8位文本。
Here's a simple trick to convert b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to'
to str
in 3.X: 这是将
b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to'
在3.X中b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to'
为str
的简单技巧:
>>> content = ''.join(chr(x) for x in b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to')
>>> content
'\x04#lwq \x7f`g \xa0\x03£,ess to'
>>> content[0]
'\x04
Decoding the bytes
string fails because you have invalid UTF-8 character bytes, same for ASCII. 解码
bytes
字符串失败,因为您有无效的UTF-8字符字节,与ASCII相同。
However, it's wise to mention that bytes
is meant to process binary data and str
is for Unicode strings only in 3.X. 但是,明智地提到
bytes
仅用于处理二进制数据,而str
仅用于3.X中的Unicode字符串。 It's recommended then to use bytes
instead of str
for binary strings in 3.X: 建议在3.X中使用
bytes
而不是str
表示二进制字符串:
>>> content = b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> hex(content[0])
'0x4'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.