简体   繁体   English

在使用utf8时,为什么通过pprint显示某些字符会转换为\\ u表示法?

[英]Why are some characters converted to \u notation when displayed by pprint when using utf8?

Here is a console demonstration: 这是一个控制台演示:

>>> x = "a b"
>>> x
'a\u200ab'
>>> repr( x )
"'a\\u200ab'"

So it seems pprint is using the same technology as printing strings does. 因此,看来pprint正在使用与打印字符串相同的技术。

Admittedly the white space character between a & b in the initial value bound to x is, indeed U+200a. 诚然,绑定到x的初始值中a和b之间的空格字符确实是U + 200a。 But when using UTF-8 input and output encodings, why would any characters be converted to \\u notation for output? 但是,当使用UTF-8输入和输出编码时,为什么要将任何字符转换为\\ u表示法进行输出?

Question 2, of course, is how can one learn what is the whole set of characters are converted in that manner? 问题2当然是一个人,如何得知以这种方式转换的整个字符集是什么?

Question 3, of course, is how can one suppress that behavior? 问题3,当然是如何抑制这种行为?

pprint prints the representation of the object you pass it. pprint打印传递给它的对象的表示形式。 From the docs 来自文档

The pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. pprint模块提供了以某种形式“漂亮地打印”任意Python数据结构的功能,该结构可用作解释器的输入。

And "a form which can be used as input to the interpreter" means you get the object's representation , ie, what its __repr__ method returns. 而“可以用作解释器输入的形式”意味着您将获得对象的表示形式 ,即对象的__repr__方法返回的内容。

If you want strings to be printed using their __str__ method instead of their __repr__ then don't use pprint . 如果你想使用自己的打印字符串__str__方法,而不是他们的__repr__那么就不要使用pprint


Here's a Python 3 code snippet that looks for chars that get represented using a \\u\u003c/code> escape code: 这是一个Python 3代码片段,用于查找使用\\u\u003c/code>转义代码表示的字符:

for i in range(1500):
    c = chr(i)
    r = repr(c)
    if r'\u' in r:
        print('{0:4} {0:04x} {1} {2}'.format(i, r, c))

output 产量

 888 0378 '\u0378' ͸
 889 0379 '\u0379' ͹
 896 0380 '\u0380' ΀
 897 0381 '\u0381' ΁
 898 0382 '\u0382' ΂
 899 0383 '\u0383' ΃
 907 038b '\u038b' ΋
 909 038d '\u038d' ΍
 930 03a2 '\u03a2' ΢
1328 0530 '\u0530' ԰
1367 0557 '\u0557' ՗
1368 0558 '\u0558' ՘
1376 0560 '\u0560' ՠ
1416 0588 '\u0588' ֈ
1419 058b '\u058b' ֋
1420 058c '\u058c' ֌
1424 0590 '\u0590' ֐
1480 05c8 '\u05c8' ׈
1481 05c9 '\u05c9' ׉
1482 05ca '\u05ca' ׊
1483 05cb '\u05cb' ׋
1484 05cc '\u05cc' ׌
1485 05cd '\u05cd' ׍
1486 05ce '\u05ce' ׎
1487 05cf '\u05cf' ׏

Note that codepoints > 0xffff get represented using a \\U escape code, when necessary. 请注意,必要时,使用\\U转义码表示> 0xffff的代码点。

for i in range(65535, 65600):
    c = chr(i)
    r = repr(c)
    if r'\u' in r.lower():
        print('{0:4} {0:04x} {1} {2}'.format(i, r, c))

output 产量

65535 ffff '\uffff' �
65548 1000c '\U0001000c' 𐀌
65575 10027 '\U00010027' 𐀧
65595 1003b '\U0001003b' 𐀻
65598 1003e '\U0001003e' 𐀾

I finally found the documentation that explains it. 我终于找到了解释它的文档。 From Python Unicode documentation : Python Unicode文档中

int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch) int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is a printable character. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

It partly answers the first question (the fact, not the reason why), and leads to the exact answer for Question 2. 它部分回答了第一个问题(事实,而不是原因),并得出了问题2的确切答案。

Unicode space separator characters Unicode空格分隔符

I suppose the desire to be visually unambiguous is the reason for the fact... all those separator characters look "the same" (white space). 我想在视觉上清晰的愿望是事实的原因……所有那些分隔符看起来都“相同”(空白)。 That might be important if you are examining a paper log, but if examining it online, copy/pasting to a hex display tool, or to This wonderfully helpful Unicode decoder is certainly sufficient, without interrupting the flow of the text when the details of which separator was used is not important (which, in my opinion, is most of the non-paper time). 如果您正在检查纸质日志,那么这可能很重要,但是如果您在线检查它,将其复制/粘贴到十六进制显示工具中,或者将其复制/粘贴到此。奇妙的Unicode解码器肯定足够了,并且在不中断文本流的情况下使用分隔符并不重要(在我看来,这是大多数非纸张时间)。

Question 3 can apparently be done in one of two ways: Creating a subclass of str with a different repr (disrupts existing code) or creating a subclass of pprint with a format function that avoids calling repr for str, but just includes the value directly. 问题3显然可以通过以下两种方式之一来完成:使用不同的repr创建str的子类(破坏现有代码),或使用避免调用str的repr的格式函数创建pprint的子类,而直接包含值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM