简体   繁体   中英

Copying LMDB to another LMDB reduces file size

To shuffle the data in already existing lmdb ( Trying to solve this problem ). I retrieved the data, shuffled and wrote back to new lmdb. But the when I checked the lmdb file size, it is reduced. Old lmdb file size: 3792896 but the New lmdb file size: 2314240.

Python code Inplemented:

import lmdb
from random import shuffle

lst_data = [];

env = lmdb.open('val_3', readonly=True);
with env.begin() as txn:
    cursor = txn.cursor();
    for key, value in cursor:
        innerlst_data = [key,value];
        lst_data.append(innerlst_data);

shuffle(lst_data);

env1 = lmdb.open('mod_val_3');
with env1.begin(write=True) as txn1:
    for i in range(len(lst_data)):
        str_id = '{:08}'.format(i);
        txn1.put(str_id.encode('ascii'),lst_data[i][1]);

Reference for the code is taken from here . Any suggestions/ideas would be helpful.

You can use mdb_stat to see the number of entries in the database. This should confirm if your copy worked correctly.

Newer versions of the lmdb Python wrappers (at least as of 1.3.0) include an environment copy method , which has a compact option that appears to do what @Ravi was trying to do. Use it like this (adjusting lmdb.open parameters as necessary):

# Copy old database into new one with compacting
# Old database is ~34G from deleting 200k of 400k original records
with lmdb.open(
    "200k-split.lmdb",
    map_size=109951162777,
    subdir=False,
    meminit=False,
    map_async=True,
) as env:
    env.copy(path="200k-split-compacted.lmdb", compact=True)

You can then verify that the compacted file has the same number of records as the original file...

with lmdb.open(
    "200k-split.lmdb",
    map_size=109951162777,
    subdir=False,
    meminit=False,
    map_async=True,
) as env:
    print(env.stat())

# {'psize': 4096, 'depth': 3, 'branch_pages': 19, 
#  'leaf_pages': 2228, 'overflow_pages': 3600000, 'entries': 200000}
with lmdb.open(
    "200k-split-compacted.lmdb",
    map_size=109951162777,
    subdir=False,
    meminit=False,
    map_async=True,
) as env:
    print(env.stat())

# {'psize': 4096, 'depth': 3, 'branch_pages': 19,
#  'leaf_pages': 2228, 'overflow_pages': 3600000, 'entries': 200000}

...but a vastly smaller file size.

> ls -lah *.lmdb
-rw-rw-r-- 1 samueldy samueldy 14G Mar  2 03:31 200k-split-compacted.lmdb
-rw-r--r-- 1 samueldy samueldy 34G Mar  2 03:29 200k-split.lmdb

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM