简体   繁体   中英

Many dictionaries using massive amounts of RAM

I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg.

{'Name': 'Jordan', 'Age': 35}

The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries.

However as you can see below, an insane amount of memory is consumed . I presume I am making a very naive mistake somewhere.

代码和任务管理器的屏幕截图显示ram使用情况

My code is as follows: (can also be viewed in the image if more readable).

import sys

# Firstly, we will create 35 million records in memory, all will be the same apart from one

def search(key, value, data, age):
    print("Searching, please wait")
    # Create list to store returned PKs
    foundPKS = []
    for index in range(0, len(data)):
        if key in data[index] and 'Age' in data[index]:
            if data[index][key] == value and data[index]['Age'] >= age:
                foundPKS.append(index)
    results = foundPKS
    return results

def createdata():
    # Let's create our list for storing our dictionaries
    print("Creating database, please wait")
    dictList = []
    for index in range(0, 35000000):
        # Define dictionary
        record = {'Name': 'Jordan', 'Age': 25}
        if 24500123 <= index <= 24500200:
            record['Name'] = 'Chris'
            record['Age'] = 33
        # Add the dict to a list
        dictList.append(record)
    return dictList

datareturned = createdata()

keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")

print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))

if len(results) > 0:
    print(str(len(results)) + " found. Writing to results.txt")
    fo = open("results.txt", "w")
    for line in range(0, len(results)):
        fo.write(str(results[line]) + "\n")
    fo.close()

What is causing the massive consumption of RAM?

The overhead for a dict object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit

In [21]: sys.getsizeof({})
Out[21]: 288

So guesstimating:

250*36e6*1e-9 == 9.0

So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list !

Rather than use a dict as a record type, which isn't really the use case, use a namedtuple .

And to get a view of how this compares, let's set up an equivalent list of tuples:

In [23]: Record = namedtuple("Record", "name age")

In [24]: records = [Record("john", 28) for _ in range(36000000)]

In [25]: getsizeof = sys.getsizeof

Consider:

In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age)  for record in records)
Out[31]: 5220000000

In [32]: _ + getsizeof(records)
Out[32]: 5517842208

In [33]: _ * 1e-9
Out[33]: 5.517842208

So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top ).

So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int objects for us, so we just have to worry about the 8-byte pointers!

In [35]: sum(getsizeof("0123456789") + 8  for record in records)
Out[35]: 2412000000

In [36]: _ + getsizeof(records)
Out[36]: 2709842208

In [37]: _ * 1e-9
Out[37]: 2.709842208

Which is a good model for what I'm observing from top .

If you really want efficient storage

Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array module in combination with struct , to get C-like memory efficiency. An easier world to wade into might be numpy instead, which allows for similar things. For example:

In [1]: import numpy as np

In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])

In [3]: records = np.empty((36000000), dtype=recordtype)

In [4]: records.nbytes
Out[4]: 756000000

In [5]: records.nbytes*1e-9
Out[5]: 0.756

Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (ie a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20' , which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.

Now, numpy gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.

In [8]: for i in range(1, 36000000+1):
   ...:     records['name'][i - 1] = b"%08d" % i
   ...:

In [9]: import random
   ...: for i in range(36000000):
   ...:     records['age'][i] = max(0, int(random.normalvariate(50, 10)))
   ...:

Now, we can use numpy to query our records . For example, if you want the indices of your records given some condition , use np.where :

In [10]: np.where(records['age'] > 70)
Out[10]: (array([      58,      146,      192, ..., 35999635, 35999768, 35999927]),)

In [11]: idx = np.where(records['age'] > 70)[0]

In [12]: len(idx)
Out[12]: 643403

So 643403 records that have an age > 70 . Now, let's try 100 :

In [13]: idx = np.where(records['age'] > 100)[0]

In [14]: len(idx)
Out[14]: 9

In [15]: idx
Out[15]:
array([ 2315458,  5088296,  5161049,  7079762, 15574072, 17995993,
       25665975, 26724665, 28322943])

In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
       (b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
       (b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
      dtype=[('name', 'S20'), ('age', 'u1')])

Of course, one major inflexibility is that numpy arrays are sized . Resizing operations are expensive. Now, you could maybe wrap a numpy.array in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite .

Let's look at this

>>> import sys 
>>> sys.getsizeof({'Name': 'Jordan', 'Age': 25}) * 35000000
10080000000

So ~10 GB. Python is doing exactly what you are asking it to do.

You need to split this up into chucks and check them sequentially.Try this as a starting point

... 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg. {'Name': 'Jordan', 'Age': 35}

You're right that this manner of storage has considerable overhead.

The Flyweight Design Pattern suggests that the solution involves factoring-out the commonalities. Here are two ideas for alternative storage of the same data with better space utilization.

You can use __slots__ to save space on instances of classes (this suppresses the creation of per-instance dictionaries):

class Person(object):
    __slots__ = ['Name', 'Age']

s = [Person('Jordan', 35), Person('Martin', 31), Person('Mary', 33)]

It is even more space-efficient to use dense data structures like a pair of parallel lists:

s_name = ['Jordan', 'Martin', 'Mary']
s_age = [35, 31, 33]

If there duplicates in the data, you save even more space by interning the values:

s_name = map(intern, s_name)

Or in Python 3:

s_name = list(map(sys.intern, s_name)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM