简体   繁体   English

为什么在失败的int转换中此Python字符串的大小会更改

[英]Why does the size of this Python String change on a failed int conversion

From the tweet here : 这里推文中

import sys
x = 'ñ'
print(sys.getsizeof(x))
int(x) #throws an error
print(sys.getsizeof(x))

We get 74, then 77 bytes for the two getsizeof calls. 对于两个getsizeof调用,我们得到74,然后是77个字节。

It looks like we are adding 3 bytes to the object, from the failed int call. 看来我们从失败的int调用中向对象添加了3个字节。

Some more examples from twitter (you may need to restart python to reset the size back to 74): 来自Twitter的更多示例(您可能需要重新启动python才能将大小重置为74):

x = 'ñ'
y = 'ñ'
int(x)
print(sys.getsizeof(y))

77! 77!

print(sys.getsizeof('ñ'))
int('ñ')
print(sys.getsizeof('ñ'))

74, then 77. 74,然后77。

The code that converts strings to ints in CPython 3.6 requests a UTF-8 form of the string to work with : 在CPython 3.6中将字符串转换为int的代码要求字符串以UTF-8形式使用

buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);

and the string creates the UTF-8 representation the first time it's requested and caches it on the string object : 并且该字符串在首次请求时创建UTF-8表示形式,并将其缓存在字符串对象上

if (PyUnicode_UTF8(unicode) == NULL) {
    assert(!PyUnicode_IS_COMPACT_ASCII(unicode));
    bytes = _PyUnicode_AsUTF8String(unicode, NULL);
    if (bytes == NULL)
        return NULL;
    _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1);
    if (_PyUnicode_UTF8(unicode) == NULL) {
        PyErr_NoMemory();
        Py_DECREF(bytes);
        return NULL;
    }
    _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes);
    memcpy(_PyUnicode_UTF8(unicode),
              PyBytes_AS_STRING(bytes),
              _PyUnicode_UTF8_LENGTH(unicode) + 1);
    Py_DECREF(bytes);
}

The extra 3 bytes are for the UTF-8 representation. 额外的3个字节用于UTF-8表示形式。


You might be wondering why the size doesn't change when the string is something like '40' or 'plain ascii text' . 您可能想知道为什么当字符串是类似'40''plain ascii text'时,大小不会改变。 That's because if the string is in "compact ascii" representation , Python doesn't create a separate UTF-8 representation. 这是因为如果字符串以“ compact ascii”表示形式 ,Python不会创建单独的UTF-8表示形式。 It returns the ASCII representation directly , which is already valid UTF-8: 直接返回ASCII表示形式 ,它已经是有效的UTF-8:

#define PyUnicode_UTF8(op)                              \
    (assert(_PyUnicode_CHECK(op)),                      \
     assert(PyUnicode_IS_READY(op)),                    \
     PyUnicode_IS_COMPACT_ASCII(op) ?                   \
         ((char*)((PyASCIIObject*)(op) + 1)) :          \
         _PyUnicode_UTF8(op))

You also might wonder why the size doesn't change for something like '1' . 您可能还会想知道为什么大小不会因'1'而改变。 That's U+FF11 FULLWIDTH DIGIT ONE, which int treats as equivalent to '1' . 这是U + FF11完整数字一位, int等效于'1' That's because one of the earlier steps in the string-to-int process is 这是因为string-to-int程序中较早的步骤之一是

asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);

which converts all whitespace characters to ' ' and converts all Unicode decimal digits to the corresponding ASCII digits. 它将所有空白字符转换为' ' ,并将所有Unicode十进制数字转换为相应的ASCII数字。 This conversion returns the original string if it doesn't end up changing anything, but when it does make changes, it creates a new string, and the new string is the one that gets a UTF-8 representation created. 如果此转换没有结束任何更改,则返回原始字符串,但是当进行更改时,它将创建一个新字符串,并且该新字符串是创建UTF-8表示形式的字符串。


As for the cases where calling int on one string looks like it affects another, those are actually the same string object. 至于在一个字符串上调用int似乎影响另一个字符串的情况,则实际上是同一字符串对象。 There are many conditions under which Python will reuse strings, all just as firmly in Weird Implementation Detail Land as everything we've discussed so far. 在许多情况下,Python将重用字符串,正如到目前为止我们所讨论的一切一样,在Weird Implementation Detail领域中,这些条件都一样牢固。 For 'ñ' , the reuse happens because this is a single-character string in the Latin-1 range ( '\\x00' - '\\xff' ), and the implementation stores and reuses those . 对于'ñ' ,发生重用是因为这是Latin-1范围内的单个字符字符串( '\\x00' - '\\xff' ),并且实现存储 '\\xff'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM