[英]Why is the output of print in python2 and python3 different with the same string?
In python2:在python2中:
$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000 08 04 87 18 0a |.....|
00000005
In python3:在python3中:
$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000 08 04 c2 87 18 0a |......|
00000006
Why does it have the byte "\\xc2"
here?为什么这里有字节
"\\xc2"
?
Edit :编辑:
I think when the string have a non-ascii character, python3 will append the byte "\\xc2"
to the string.我认为当字符串具有非 ascii 字符时,python3 会将字节
"\\xc2"
附加到字符串。 (as @Ashraful Islam said) (正如@Ashraful Islam 所说)
So how can I avoid this in python3?那么如何在python3中避免这种情况呢?
Consider the following snippet of code:考虑以下代码片段:
import sys
for i in range(128, 256):
sys.stdout.write(chr(i))
Run this with Python 2 and look at the result with hexdump -C
:使用 Python 2 运行它并使用
hexdump -C
查看结果:
00000000 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
Et cetera.等等。 No surprises;
没有惊喜; 128 bytes from
0x80
to 0xff
.从
0x80
到0xff
128 个字节。
Do the same with Python 3:用 Python 3 做同样的事情:
00000000 c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
...
00000070 c2 b8 c2 b9 c2 ba c2 bb c2 bc c2 bd c2 be c2 bf |................|
00000080 c3 80 c3 81 c3 82 c3 83 c3 84 c3 85 c3 86 c3 87 |................|
...
000000f0 c3 b8 c3 b9 c3 ba c3 bb c3 bc c3 bd c3 be c3 bf |................|
To summarize:总结一下:
0x80
to 0xbf
has 0xc2
prepended.0x80
至0xbf
已经0xc2
前缀。0xc0
to 0xff
has bit 6 set to zero and has 0xc3
prepended.0xc0
到0xff
所有内容都将第 6 位设置为零,并在前面添加了0xc3
。 So, what's going on here?那么,这里发生了什么?
In Python 2, strings are ASCII and no conversion is done.在 Python 2 中,字符串是 ASCII 并且不进行转换。 Tell it to write something outside the 0-127 ASCII range, it says “okey-doke!”
告诉它写一些 0-127 ASCII 范围之外的东西,它说“oky-doke!” and just writes those bytes.
并只写入这些字节。 Simple.
简单的。
In Python 3, strings are Unicode .在 Python 3 中,字符串是Unicode 。 When non-ASCII characters are written, they must be encoded in some way.
写入非 ASCII 字符时,必须以某种方式对其进行编码。 The default encoding is UTF-8.
默认编码为 UTF-8。
So, how are these values encoded in UTF-8?那么,这些值是如何用 UTF-8 编码的呢?
Code points from 0x80
to 0x7ff
are encoded as follows:从
0x80
到0x7ff
代码点编码如下:
110vvvvv 10vvvvvv
Where the 11 v
characters are the bits of the code point.其中 11 个
v
字符是代码点的位。
Thus:因此:
0x80 hex
1000 0000 8-bit binary
000 1000 0000 11-bit binary
00010 000000 divide into vvvvv vvvvvv
11000010 10000000 resulting UTF-8 octets in binary
0xc2 0x80 resulting UTF-8 octets in hex
0xc0 hex
1100 0000 8-bit binary
000 1100 0000 11-bit binary
00011 000000 divide into vvvvv vvvvvv
11000011 10000000 resulting UTF-8 octets in binary
0xc3 0x80 resulting UTF-8 octets in hex
So that's why you're getting a c2
before 87
.所以这就是你在
87
之前获得c2
的原因。
How to avoid all this in Python 3?如何在 Python 3 中避免这一切? Use the
bytes
type.使用
bytes
类型。
Python 2's default string type is byte strings. Python 2 的默认字符串类型是字节字符串。 Byte strings are written
"abc"
while Unicode strings are written u"abc"
.字节字符串写为
"abc"
而 Unicode 字符串写为u"abc"
。
Python 3's default string type is Unicode strings. Python 3 的默认字符串类型是 Unicode 字符串。 Byte strings are written as
b"abc"
while Unicode strings are written "abc"
( u"abc"
still works, too).字节字符串写为
b"abc"
而 Unicode 字符串写为"abc"
( u"abc"
仍然有效)。 since there are millions of Unicode characters, printing them as bytes requires an encoding ( UTF-8 in your case) which requires multiple bytes per code point.由于有数百万个 Unicode 字符,将它们打印为字节需要一种编码(在您的情况下为UTF-8 ),每个代码点需要多个字节。
First use a byte string in Python 3 to get the same Python 2 type.首先在 Python 3 中使用字节字符串来获取与 Python 2 相同的类型。 Then, because Python 3's
print
expects Unicode strings, use sys.stdout.buffer.write
to write to the raw stdout interface, which expects byte strings.然后,因为 Python 3 的
print
需要 Unicode 字符串,所以使用sys.stdout.buffer.write
写入原始 stdout 接口,它需要字节字符串。
python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'
Note that if writing to a file, there are similar issues.请注意,如果写入文件,则存在类似问题。 For no encoding translation, open files in binary mode
'wb'
and write byte strings.对于无编码转换,以二进制模式
'wb'
打开文件并写入字节字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.