简体   繁体   English

在python 3.6中将字节转换为字符串

[英]Convert bytes to string in python 3.6

I am trying to read and process a file. 我正在尝试读取和处理文件。 This wroks perfectly fine in Python2.7 but I can't get it working in Python 3. In Python 2.7 it works without providing any encoding whereas in Python 3 I have tried all combinations with and without encoding. 这在Python2.7中完全正常,但我无法在Python 3中使用它。在Python 2.7中,它无需提供任何编码即可工作,而在Python 3中,我尝试了有编码和无编码的所有组合。

After deep diving, I found that the way content returned by read is different in both the versions. 深入研究后,我发现两个版本中read返回的内容的方式不同。

Code in Python 2.7 that works: 适用于Python 2.7的代码:

>>> f = open('resource.cgn', 'r')
>>> content = f.read()
>>> type(content)
<type 'str'>
>>> content[0:20]
'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> content[0]
'\x04'

However in Python 3: 但是在Python 3中:

>>> f = open('resource.cgn','r')
>>> content = f.read()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
  UnicodeDecodeError: 'ascii' codec cant decode byte 0xa0 in position 10: ordinal not in range(128)
>>> f = open('resource.cgn','rb')
>>> content = f.read()
>>> type(content)                   
<class 'bytes'>
>>> content[0:20]
b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> content[0]
4
>>> content.decode('utf8')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10: 
invalid start byte

I would like to get the same output as in Python 2.7. 我想获得与Python 2.7相同的输出。 The content should be of type string and content[0] should be str '\\x04' and not int 4 content应该是string类型, content[0]应该是str '\\x04'而不是int 4

Any pointers on how can I get this? 关于如何获得此指示? I have tried encodings without any success. 我尝试了编码,但没有成功。

3.X's str is now 2.X's unicode by default and file objects opened in text mode in 3.X attempt to decode and encode when your files are read from or written to, respectively. 3.X的str现在默认为2.X的unicode ,并且在3.X中以文本模式打开的文件对象尝试分别在读取或写入文件时进行解码和编码。 str of 2.X is now bytes in 3.X. 2.X的str现在是3.X中的bytes There's really very minor differences between 3.X bytes and 2.X's str both essentially hold 8-bit text. 3.X bytes和2.X的str之间实际上只有很小的区别,它们基本上都保留8位文本。

Here's a simple trick to convert b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to' to str in 3.X: 这是将b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to'在3.X中b'\\x04#lwq \\x7f`g \\xa0\\x03\\xa3,ess to'str的简单技巧:

>>> content = ''.join(chr(x) for x in b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to')
>>> content
'\x04#lwq \x7f`g \xa0\x03£,ess to'
>>> content[0]
'\x04

Decoding the bytes string fails because you have invalid UTF-8 character bytes, same for ASCII. 解码bytes字符串失败,因为您有无效的UTF-8字符字节,与ASCII相同。

However, it's wise to mention that bytes is meant to process binary data and str is for Unicode strings only in 3.X. 但是,明智地提到bytes仅用于处理二进制数据,而str仅用于3.X中的Unicode字符串。 It's recommended then to use bytes instead of str for binary strings in 3.X: 建议在3.X中使用bytes而不是str表示二进制字符串:

>>> content = b'\x04#lwq \x7f`g \xa0\x03\xa3,ess to'
>>> hex(content[0])
'0x4'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM