[英]How are small sets stored in memory?
如果我們查看 50k 元素以下的集合的調整大小行為:
>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
... size = sys.getsizeof(s)
... if size not in seen:
... seen[size] = len(s)
... print(f"{size=} {len(s)=}")
... s.add(i)
...
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661
一旦集滿了 3/5,這種模式與后備存儲大小的四倍相一致,再加上PySetObject
的一些大概不變的開銷:
>>> for i in range(9, 22, 2):
... print(2**i + 216)
...
728
2264
8408
32984
131288
524504
2097368
類似的模式甚至適用於更大的集合,但調整大小因子切換為加倍而不是四倍。
小集的報告大小是異常值。 sys.getsizeof
報告的不是 344 字節大小,即 16 * 8 + 216(新創建的空集的存儲陣列有 8 個可用插槽,直到第一次調整大小為 32 個插槽)只有 216 字節。
我錯過了什么? 這些小集合是如何存儲的,以便它們只使用 216 個字節而不是 344 個字節?
在 python 中set
object 由以下 C 結構表示。
typedef struct {
PyObject_HEAD
Py_ssize_t fill; /* Number active and dummy entries*/
Py_ssize_t used; /* Number active entries */
/* The table contains mask + 1 slots, and that's a power of 2.
* We store the mask instead of the size because the mask is more
* frequently needed.
*/
Py_ssize_t mask;
/* The table points to a fixed-size smalltable for small tables
* or to additional malloc'ed memory for bigger tables.
* The table pointer is never NULL which saves us from repeated
* runtime null-tests.
*/
setentry *table;
Py_hash_t hash; /* Only used by frozenset objects */
Py_ssize_t finger; /* Search finger for pop() */
setentry smalltable[PySet_MINSIZE];
PyObject *weakreflist; /* List of weak references */
} PySetObject;
現在請記住,如果 object 由垃圾收集器管理,則getsizeof()
調用對象的__sizeof__
方法並添加額外的垃圾收集器開銷。
好的, set
實現了__sizeof__
。
static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
Py_ssize_t res;
res = _PyObject_SIZE(Py_TYPE(so));
if (so->table != so->smalltable)
res = res + (so->mask + 1) * sizeof(setentry);
return PyLong_FromSsize_t(res);
}
現在讓我們檢查線路
res = _PyObject_SIZE(Py_TYPE(so));
_PyObject_SIZE
只是一個擴展為(typeobj)->tp_basicsize
的宏。
#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )
這段代碼本質上是試圖訪問tp_basicsize
槽以獲取類型實例的大小(以字節為單位),在set
的情況下就是sizeof(PySetObject)
。
PyTypeObject PySet_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"set", /* tp_name */
sizeof(PySetObject), /* tp_basicsize */
0, /* tp_itemsize */
# Skipped rest of the code for brevity.
我已經通過以下更改修改了set_sizeof
C function。
static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
Py_ssize_t res;
unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
unsigned long fill_size = sizeof(so->fill);
unsigned long used_size = sizeof(so->used);
unsigned long mask_size = sizeof(so->mask);
unsigned long table_size = sizeof(so->table);
unsigned long hash_size = sizeof(so->hash);
unsigned long finger_size = sizeof(so->finger);
unsigned long smalltable_size = sizeof(so->smalltable);
unsigned long weakreflist_size = sizeof(so->weakreflist);
int is_using_fixed_size_smalltables = so->table == so->smalltable;
printf("| PySetObject Fields | Size(bytes) |\n");
printf("|------------------------------------|\n");
printf("| PyObject_HEAD | '%zu' |\n", py_object_head_size);
printf("| fill | '%zu' |\n", fill_size);
printf("| used | '%zu' |\n", used_size);
printf("| mask | '%zu' |\n", mask_size);
printf("| table | '%zu' |\n", table_size);
printf("| hash | '%zu' |\n", hash_size);
printf("| finger | '%zu' |\n", finger_size);
printf("| smalltable | '%zu' |\n", smalltable_size);
printf("| weakreflist | '%zu' |\n", weakreflist_size);
printf("-------------------------------------|\n");
printf("| Total | '%zu' |\n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
printf("\n");
printf("Total size of PySetObject '%zu' bytes\n", sizeof(PySetObject));
printf("Has set resized: '%s'\n", is_using_fixed_size_smalltables ? "No": "Yes");
if(!is_using_fixed_size_smalltables) {
printf("Size of malloc'ed table: '%zu' bytes\n", (so->mask + 1) * sizeof(setentry));
}
res = _PyObject_SIZE(Py_TYPE(so));
if (so->table != so->smalltable)
res = res + (so->mask + 1) * sizeof(setentry);
return PyLong_FromSsize_t(res);
}
編譯並運行這些更改給了我
>>> import sys
>>>
>>> set_ = set()
>>> sys.getsizeof(set_)
| PySetObject Fields | Size(bytes) |
|------------------------------------|
| PyObject_HEAD | '16' |
| fill | '8' |
| used | '8' |
| mask | '8' |
| table | '8' |
| hash | '8' |
| finger | '8' |
| smalltable | '128' |
| weakreflist | '8' |
-------------------------------------|
| Total | '200' |
Total size of PySetObject '200' bytes
Has set resized: 'No'
216
>>> set_.add(1)
>>> set_.add(2)
>>> set_.add(3)
>>> set_.add(4)
>>> set_.add(5)
>>> sys.getsizeof(set_)
| PySetObject Fields | Size(bytes) |
|------------------------------------|
| PyObject_HEAD | '16' |
| fill | '8' |
| used | '8' |
| mask | '8' |
| table | '8' |
| hash | '8' |
| finger | '8' |
| smalltable | '128' |
| weakreflist | '8' |
-------------------------------------|
| Total | '200' |
Total size of PySetObject '200' bytes
Has set resized: 'Yes'
Size of malloc'ed table: '512' bytes
728
返回值為 216/728 字節,因為sys.getsize
增加了16
字節的 GC 開銷。
但這里要注意的重要一點是這一行。
| smalltable | '128' |
因為對於小表(在第一次調整大小之前) so->table
只是對固定大小( 8
so->smalltable
的引用所以->smalltable(沒有 malloc'ed 內存)所以sizeof(PySetObject)
足以獲得大小,因為它也包括存儲大小( 128(16(size of setentry) * 8)
)。
現在發生調整大小時會發生什么。 它構建了全新的表(malloc'ed)並使用該表而不是so->smalltables
,這意味着已調整大小的集合還執行 128 字節的自重( 固定大小的小表的大小)以及malloc'ed so->table
的大小。
else {
newtable = PyMem_NEW(setentry, newsize);
if (newtable == NULL) {
PyErr_NoMemory();
return -1;
}
}
/* Make the set empty, using the new table. */
assert(newtable != oldtable);
memset(newtable, 0, sizeof(setentry) * newsize);
so->mask = newsize - 1;
so->table = newtable;
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.