Why does this usage of py-leveldb's WriteBatch cause a memory leak?

Question

So I'm writing a Python script for indexing the Bitcoin blockchain by addresses, using a leveldb database ( py-leveldb ), and it keeps eating more and more memory until it crashes. I've replicated the behaviour in the code example below. When I run the code it continues to use more and more memory until it's exhausted the available RAM on my system and the process is either killed or throws "std::bad_alloc".

Am I doing something wrong? I keep writing to the batch object, and commit it every once in a while, but the memory usage keeps increasing even though I commit the data in the WriteBatch object. I even delete the WriteBatch object after comitting it, so as far as I can see it can't be this that is causing the memory leak.

Is my code using WriteBatch in a wrong way or is there a memory leak in py-leveldb?

The code requires py-leveldb to run, get it from here: https://pypi.python.org/pypi/leveldb

WARNING: RUNNING THIS CODE WILL EXHAUST YOUR MEMORY IF IT RUNS LONG ENOUGH. DO NOT RUN ON ON A CRITICAL SYSTEM. Also, it will write the data to a folder in the same folder as the script runs in, on my system this folder contains about 1.5GB worth of database files before memory is exhaused (it ends up consuming over 3GB of RAM).

Here's the code:

import leveldb, random, string

RANDOM_DB_NAME = "db-DetmREnTrKjd"
KEYLEN = 10
VALLEN = 30
num_keys = 1000
iterations = 100000000
commit_every = 1000000

leveldb.DestroyDB(RANDOM_DB_NAME)
db = leveldb.LevelDB(RANDOM_DB_NAME)

batch = leveldb.WriteBatch()

#generate a random list of keys to be used
key_list = [''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(KEYLEN)) for i in range(0,num_keys)]

for k in xrange(iterations):
    #select a random key from the key list
    key_index = random.randrange(0,1000)
    key = key_list[key_index]

    try:
        prev_val = db.Get(key)
    except KeyError:
        prev_val = ""

    random_val = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(VALLEN))
    #write the current random value plus any value that might already be there
    batch.Put(key, prev_val + random_val)

    if k % commit_every == 0:
        print "Comitting batch %d/%d..." % (k/commit_every, iterations/commit_every)
        db.Write(batch, sync=True)
        del batch
        batch = leveldb.WriteBatch()

db.Write(batch, sync=True)

Answer 1

You should really try Plyvel instead. See https://plyvel.readthedocs.org/ . It has way cleaner code, more features, more speed, and a lot more tests. I've used it for bulk writing to quite large databases (20+ GB) without any issues.

(Full disclosure: I'm the author.)

Answer 2

I use http://code.google.com/p/leveldb-py/

I don't have enough information to participate in a python leveldb driver bake-off, but I love the simplicity of leveldb-py. It is a single python file using ctypes. I've used it to store documents in about 3 million keys storing about 10GB and never noticed memory problems.

To your actual problem: You may try working with the batch size.

Your code, using leveldb-py and doing a put for every key, worked fine on my system using less than 20MB of memory.

I take from here ( http://ayende.com/blog/161412/reviewing-leveldb-part-iii-writebatch-isnt-what-you-think-it-is ) that there are quite a few memory copies going on under the hood in leveldb.

Why does this usage of py-leveldb's WriteBatch cause a memory leak?

Question

2 answers

solution1
5 2013-11-27 20:51:21

solution2
1 2013-12-13 15:00:58

Why does this usage of py-leveldb's WriteBatch cause a memory leak?

Question

2 answers

solution1 5 2013-11-27 20:51:21

solution2 1 2013-12-13 15:00:58

solution1
5 2013-11-27 20:51:21

solution2
1 2013-12-13 15:00:58