为什么python2和python3的print输出同一个字符串不同？

Question

In python2:在python2中：

$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000  08 04 87 18 0a                                    |.....|
00000005

In python3:在python3中：

$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000  08 04 c2 87 18 0a                                 |......|
00000006

Why does it have the byte "\\xc2" here?为什么这里有字节"\\xc2" ？

Edit :编辑：

I think when the string have a non-ascii character, python3 will append the byte "\\xc2" to the string.我认为当字符串具有非 ascii 字符时，python3 会将字节"\\xc2"附加到字符串。 (as @Ashraful Islam said) （正如@Ashraful Islam 所说）

So how can I avoid this in python3?那么如何在python3中避免这种情况呢？

Answer 1

Consider the following snippet of code:考虑以下代码片段：

import sys
for i in range(128, 256):
    sys.stdout.write(chr(i))

Run this with Python 2 and look at the result with hexdump -C :使用 Python 2 运行它并使用hexdump -C查看结果：

00000000  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|

Et cetera.等等。 No surprises;没有惊喜； 128 bytes from 0x80 to 0xff .从0x80到0xff 128 个字节。

Do the same with Python 3:用 Python 3 做同样的事情：

00000000  c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
...
00000070  c2 b8 c2 b9 c2 ba c2 bb  c2 bc c2 bd c2 be c2 bf  |................|
00000080  c3 80 c3 81 c3 82 c3 83  c3 84 c3 85 c3 86 c3 87  |................|
...
000000f0  c3 b8 c3 b9 c3 ba c3 bb  c3 bc c3 bd c3 be c3 bf  |................|

To summarize:总结一下：

Everything from 0x80 to 0xbf has 0xc2 prepended.一切从0x80至0xbf已经0xc2前缀。
Everything from 0xc0 to 0xff has bit 6 set to zero and has 0xc3 prepended.从0xc0到0xff所有内容都将第 6 位设置为零，并在前面添加了0xc3 。

So, what's going on here?那么，这里发生了什么？

In Python 2, strings are ASCII and no conversion is done.在 Python 2 中，字符串是 ASCII 并且不进行转换。 Tell it to write something outside the 0-127 ASCII range, it says “okey-doke!”告诉它写一些 0-127 ASCII 范围之外的东西，它说“oky-doke！” and just writes those bytes.并只写入这些字节。 Simple.简单的。

In Python 3, strings are Unicode .在 Python 3 中，字符串是Unicode 。 When non-ASCII characters are written, they must be encoded in some way.写入非 ASCII 字符时，必须以某种方式对其进行编码。 The default encoding is UTF-8.默认编码为 UTF-8。

So, how are these values encoded in UTF-8?那么，这些值是如何用 UTF-8 编码的呢？

Code points from 0x80 to 0x7ff are encoded as follows:从0x80到0x7ff代码点编码如下：

110vvvvv 10vvvvvv

Where the 11 v characters are the bits of the code point.其中 11 个v字符是代码点的位。

Thus:因此：

0x80                 hex
1000 0000            8-bit binary
000 1000 0000        11-bit binary
00010 000000         divide into vvvvv vvvvvv
11000010 10000000    resulting UTF-8 octets in binary
0xc2 0x80            resulting UTF-8 octets in hex

0xc0                 hex
1100 0000            8-bit binary
000 1100 0000        11-bit binary
00011 000000         divide into vvvvv vvvvvv
11000011 10000000    resulting UTF-8 octets in binary
0xc3 0x80            resulting UTF-8 octets in hex

So that's why you're getting a c2 before 87 .所以这就是你在87之前获得c2的原因。

How to avoid all this in Python 3?如何在 Python 3 中避免这一切？ Use the bytes type.使用bytes类型。

Answer 2

Python 2's default string type is byte strings. Python 2 的默认字符串类型是字节字符串。 Byte strings are written "abc" while Unicode strings are written u"abc" .字节字符串写为"abc"而 Unicode 字符串写为u"abc" 。

Python 3's default string type is Unicode strings. Python 3 的默认字符串类型是 Unicode 字符串。 Byte strings are written as b"abc" while Unicode strings are written "abc" ( u"abc" still works, too).字节字符串写为b"abc"而 Unicode 字符串写为"abc" （ u"abc"仍然有效）。 since there are millions of Unicode characters, printing them as bytes requires an encoding ( UTF-8 in your case) which requires multiple bytes per code point.由于有数百万个 Unicode 字符，将它们打印为字节需要一种编码（在您的情况下为UTF-8 ），每个代码点需要多个字节。

First use a byte string in Python 3 to get the same Python 2 type.首先在 Python 3 中使用字节字符串来获取与 Python 2 相同的类型。 Then, because Python 3's print expects Unicode strings, use sys.stdout.buffer.write to write to the raw stdout interface, which expects byte strings.然后，因为 Python 3 的print需要 Unicode 字符串，所以使用sys.stdout.buffer.write写入原始 stdout 接口，它需要字节字符串。

python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'

Note that if writing to a file, there are similar issues.请注意，如果写入文件，则存在类似问题。 For no encoding translation, open files in binary mode 'wb' and write byte strings.对于无编码转换，以二进制模式'wb'打开文件并写入字节字符串。

为什么python2和python3的print输出同一个字符串不同？

问题描述

2 个解决方案

解决方案1
18 已采纳 2017-03-19 08:41:40

解决方案2
5 2017-03-19 16:53:46

为什么python2和python3的print输出同一个字符串不同？

问题描述

2 个解决方案

解决方案1 18 已采纳 2017-03-19 08:41:40

解决方案2 5 2017-03-19 16:53:46

解决方案1
18 已采纳 2017-03-19 08:41:40

解决方案2
5 2017-03-19 16:53:46