簡體   English   中英

小集合在memory中是怎么存儲的?

[英]How are small sets stored in memory?

如果我們查看 50k 元素以下的集合的調整大小行為:

>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
...     size = sys.getsizeof(s)
...     if size not in seen:
...         seen[size] = len(s)
...         print(f"{size=} {len(s)=}")
...     s.add(i)
... 
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661

一旦集滿了 3/5,這種模式與后備存儲大小的四倍相一致,再加上PySetObject的一些大概不變的開銷:

>>> for i in range(9, 22, 2):
...     print(2**i + 216)
... 
728
2264
8408
32984
131288
524504
2097368

類似的模式甚至適用於更大的集合,但調整大小因子切換為加倍而不是四倍。

小集的報告大小是異常值。 sys.getsizeof報告的不是 344 字節大小,即 16 * 8 + 216(新創建的空集的存儲陣列有 8 個可用插槽,直到第一次調整大小為 32 個插槽)只有 216 字節。

我錯過了什么? 這些小集合是如何存儲的,以便它們只使用 216 個字節而不是 344 個字節?

在 python 中set object 由以下 C 結構表示。

typedef struct {
    PyObject_HEAD

    Py_ssize_t fill;            /* Number active and dummy entries*/
    Py_ssize_t used;            /* Number active entries */

    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;

    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */

    setentry smalltable[PySet_MINSIZE];
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;

現在請記住,如果 object 由垃圾收集器管理,則getsizeof()調用對象的__sizeof__方法並添加額外的垃圾收集器開銷

好的, set實現了__sizeof__

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

現在讓我們檢查線路

res = _PyObject_SIZE(Py_TYPE(so));

_PyObject_SIZE只是一個擴展為(typeobj)->tp_basicsize的宏。

#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )

這段代碼本質上是試圖訪問tp_basicsize槽以獲取類型實例的大小(以字節為單位),在set的情況下就是sizeof(PySetObject)

PyTypeObject PySet_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "set",                              /* tp_name */
    sizeof(PySetObject),                /* tp_basicsize */
    0,                                  /* tp_itemsize */
    # Skipped rest of the code for brevity.

我已經通過以下更改修改了set_sizeof C function。

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
    unsigned long fill_size = sizeof(so->fill);
    unsigned long used_size = sizeof(so->used);
    unsigned long mask_size = sizeof(so->mask);
    unsigned long table_size = sizeof(so->table);
    unsigned long hash_size = sizeof(so->hash);
    unsigned long finger_size = sizeof(so->finger);
    unsigned long smalltable_size = sizeof(so->smalltable);
    unsigned long weakreflist_size = sizeof(so->weakreflist);
    int is_using_fixed_size_smalltables = so->table == so->smalltable;

    printf("| PySetObject Fields   | Size(bytes) |\n");
    printf("|------------------------------------|\n");
    printf("|    PyObject_HEAD     |     '%zu'    |\n", py_object_head_size);
    printf("|      fill            |      '%zu'    |\n", fill_size);
    printf("|      used            |      '%zu'    |\n", used_size);
    printf("|      mask            |      '%zu'    |\n", mask_size);
    printf("|      table           |      '%zu'    |\n", table_size);
    printf("|      hash            |      '%zu'    |\n", hash_size);
    printf("|      finger          |      '%zu'    |\n", finger_size);
    printf("|    smalltable        |    '%zu'    |\n", smalltable_size); 
    printf("|    weakreflist       |      '%zu'    |\n", weakreflist_size);
    printf("-------------------------------------|\n");
    printf("|       Total          |    '%zu'    |\n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
    printf("\n");
    printf("Total size of PySetObject '%zu' bytes\n", sizeof(PySetObject));
    printf("Has set resized: '%s'\n", is_using_fixed_size_smalltables ? "No": "Yes");
    if(!is_using_fixed_size_smalltables) {
        printf("Size of malloc'ed table: '%zu' bytes\n", (so->mask + 1) * sizeof(setentry));
    }

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

編譯並運行這些更改給了我

>>> import sys
>>> 
>>> set_ = set()
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'No'
216
>>> set_.add(1)
>>> set_.add(2)
>>> set_.add(3)
>>> set_.add(4)
>>> set_.add(5)
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'Yes'
Size of malloc'ed table: '512' bytes
728

返回值為 216/728 字節,因為sys.getsize增加了16字節的 GC 開銷

但這里要注意的重要一點是這一行。

|    smalltable        |    '128'    |

因為對於小表(在第一次調整大小之前) so->table只是對固定大小( 8 so->smalltable 的引用所以->smalltable(沒有 malloc'ed 內存)所以sizeof(PySetObject)足以獲得大小,因為它也包括存儲大小( 128(16(size of setentry) * 8) )。

現在發生調整大小時會發生什么。 它構建了全新的表(malloc'ed)並使用該表而不是so->smalltables ,這意味着已調整大小的集合還執行 128 字節的自重( 固定大小的小表的大小)以及malloc'ed so->table的大小。

else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM