[英]Why does the size of this Python String change on a failed int conversion
From the tweet here : 从这里的推文中 :
import sys
x = 'ñ'
print(sys.getsizeof(x))
int(x) #throws an error
print(sys.getsizeof(x))
We get 74, then 77 bytes for the two getsizeof
calls. 对于两个
getsizeof
调用,我们得到74,然后是77个字节。
It looks like we are adding 3 bytes to the object, from the failed int call. 看来我们从失败的int调用中向对象添加了3个字节。
Some more examples from twitter (you may need to restart python to reset the size back to 74): 来自Twitter的更多示例(您可能需要重新启动python才能将大小重置为74):
x = 'ñ'
y = 'ñ'
int(x)
print(sys.getsizeof(y))
77! 77!
print(sys.getsizeof('ñ'))
int('ñ')
print(sys.getsizeof('ñ'))
74, then 77. 74,然后77。
The code that converts strings to ints in CPython 3.6 requests a UTF-8 form of the string to work with : 在CPython 3.6中将字符串转换为int的代码要求字符串以UTF-8形式使用 :
buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);
and the string creates the UTF-8 representation the first time it's requested and caches it on the string object : 并且该字符串在首次请求时创建UTF-8表示形式,并将其缓存在字符串对象上 :
if (PyUnicode_UTF8(unicode) == NULL) {
assert(!PyUnicode_IS_COMPACT_ASCII(unicode));
bytes = _PyUnicode_AsUTF8String(unicode, NULL);
if (bytes == NULL)
return NULL;
_PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1);
if (_PyUnicode_UTF8(unicode) == NULL) {
PyErr_NoMemory();
Py_DECREF(bytes);
return NULL;
}
_PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes);
memcpy(_PyUnicode_UTF8(unicode),
PyBytes_AS_STRING(bytes),
_PyUnicode_UTF8_LENGTH(unicode) + 1);
Py_DECREF(bytes);
}
The extra 3 bytes are for the UTF-8 representation. 额外的3个字节用于UTF-8表示形式。
You might be wondering why the size doesn't change when the string is something like '40'
or 'plain ascii text'
. 您可能想知道为什么当字符串是类似
'40'
或'plain ascii text'
时,大小不会改变。 That's because if the string is in "compact ascii" representation , Python doesn't create a separate UTF-8 representation. 这是因为如果字符串以“ compact ascii”表示形式 ,Python不会创建单独的UTF-8表示形式。 It returns the ASCII representation directly , which is already valid UTF-8:
它直接返回ASCII表示形式 ,它已经是有效的UTF-8:
#define PyUnicode_UTF8(op) \
(assert(_PyUnicode_CHECK(op)), \
assert(PyUnicode_IS_READY(op)), \
PyUnicode_IS_COMPACT_ASCII(op) ? \
((char*)((PyASCIIObject*)(op) + 1)) : \
_PyUnicode_UTF8(op))
You also might wonder why the size doesn't change for something like '1'
. 您可能还会想知道为什么大小不会因
'1'
而改变。 That's U+FF11 FULLWIDTH DIGIT ONE, which int
treats as equivalent to '1'
. 这是U + FF11完整数字一位,
int
等效于'1'
。 That's because one of the earlier steps in the string-to-int process is 这是因为string-to-int程序中较早的步骤之一是
asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);
which converts all whitespace characters to ' '
and converts all Unicode decimal digits to the corresponding ASCII digits. 它将所有空白字符转换为
' '
,并将所有Unicode十进制数字转换为相应的ASCII数字。 This conversion returns the original string if it doesn't end up changing anything, but when it does make changes, it creates a new string, and the new string is the one that gets a UTF-8 representation created. 如果此转换没有结束任何更改,则返回原始字符串,但是当进行更改时,它将创建一个新字符串,并且该新字符串是创建UTF-8表示形式的字符串。
As for the cases where calling int
on one string looks like it affects another, those are actually the same string object. 至于在一个字符串上调用
int
似乎影响另一个字符串的情况,则实际上是同一字符串对象。 There are many conditions under which Python will reuse strings, all just as firmly in Weird Implementation Detail Land as everything we've discussed so far. 在许多情况下,Python将重用字符串,正如到目前为止我们所讨论的一切一样,在Weird Implementation Detail领域中,这些条件都一样牢固。 For
'ñ'
, the reuse happens because this is a single-character string in the Latin-1 range ( '\\x00'
- '\\xff'
), and the implementation stores and reuses those . 对于
'ñ'
,发生重用是因为这是Latin-1范围内的单个字符字符串( '\\x00'
- '\\xff'
),并且实现存储 '\\xff'
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.