简体   繁体   中英

Why is \x00 not converted to \0 by repr

Here is an interesting oddity about Python's repr:

The tab character \\x09 is represented as \\t . However this convention does not apply for the null terminator.

Why is \\x00 represented as \\x00 , rather than \\0 ?

Sample code:

# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True

>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0

The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\\\ , \\n , \\r , \\t , (plus \\' when both " and ' characters are present) because there are explicit tests for those.

The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \\xhh , \\uhhhh and \\Uhhhhhhhh , always using the shortest of the 3 options that'll fit the value).

Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]) , or bytes([0x00, 0x4A]) , etc), you can't just use \\0 in the output without then also having to escape the following digit. '\\01' is a single octal escape sequence, and not the same value as '\\x001' , which is two bytes. While forcing the output to always use three octal digits (eg '\\0001' ) could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\\x001' and '\\0Ol' ? )

The output is always consistent. Apart from the single quote (which can appear either with ' or \\' , depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.

If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function , which uses

/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, ch);
    continue;
}


/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'r');
}

for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.

If it tried to use \\0 , then it would have to special-case when numbers immediately followed it, to prevent them from being interpreted as an octal literal. Always using \\x00 is simpler and always correct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM