简体   繁体   English

许多词典使用大量的RAM

[英]Many dictionaries using massive amounts of RAM

I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. 我有一个非常简单的Python脚本来创建(用于测试目的),列表中有3500万个字典对象。 Each dictionary object contains two key/value pairs. 每个字典对象包含两个键/值对。 eg. 例如。

{'Name': 'Jordan', 'Age': 35}

The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries. 该脚本非常简单地对名称和年龄进行查询,搜索字典列表并返回包含所有匹配字典条目索引的新列表。

However as you can see below, an insane amount of memory is consumed . 但是,如下所示, 消耗了大量内存 I presume I am making a very naive mistake somewhere. 我认为我在某个地方犯了一个非常天真的错误。

代码和任务管理器的屏幕截图显示ram使用情况

My code is as follows: (can also be viewed in the image if more readable). 我的代码如下:(如果更具可读性,也可以在图像中查看)。

import sys

# Firstly, we will create 35 million records in memory, all will be the same apart from one

def search(key, value, data, age):
    print("Searching, please wait")
    # Create list to store returned PKs
    foundPKS = []
    for index in range(0, len(data)):
        if key in data[index] and 'Age' in data[index]:
            if data[index][key] == value and data[index]['Age'] >= age:
                foundPKS.append(index)
    results = foundPKS
    return results

def createdata():
    # Let's create our list for storing our dictionaries
    print("Creating database, please wait")
    dictList = []
    for index in range(0, 35000000):
        # Define dictionary
        record = {'Name': 'Jordan', 'Age': 25}
        if 24500123 <= index <= 24500200:
            record['Name'] = 'Chris'
            record['Age'] = 33
        # Add the dict to a list
        dictList.append(record)
    return dictList

datareturned = createdata()

keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")

print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))

if len(results) > 0:
    print(str(len(results)) + " found. Writing to results.txt")
    fo = open("results.txt", "w")
    for line in range(0, len(results)):
        fo.write(str(results[line]) + "\n")
    fo.close()

What is causing the massive consumption of RAM? 什么导致大量消耗RAM?

The overhead for a dict object is quite large. dict对象的开销非常大。 It depends on your Python version and your system architechture, but on Python 3.5 64bit 这取决于您的Python版本和系统架构,但取决于Python 3.5 64位

In [21]: sys.getsizeof({})
Out[21]: 288

So guesstimating: 所以估计:

250*36e6*1e-9 == 9.0

So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list ! 所以这是一个下限在千兆字节我的内存使用情况,如果我创造了很多字典,而不是在保理list

Rather than use a dict as a record type, which isn't really the use case, use a namedtuple . 而不是使用dict作为记录类型,而不是用例,请使用namedtuple

And to get a view of how this compares, let's set up an equivalent list of tuples: 为了了解这是如何比较的,让我们设置一个等效的元组列表:

In [23]: Record = namedtuple("Record", "name age")

In [24]: records = [Record("john", 28) for _ in range(36000000)]

In [25]: getsizeof = sys.getsizeof

Consider: 考虑:

In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age)  for record in records)
Out[31]: 5220000000

In [32]: _ + getsizeof(records)
Out[32]: 5517842208

In [33]: _ * 1e-9
Out[33]: 5.517842208

So 5 gigs is an upper limit that is quite conservative. 所以5演出是一个相当保守的上限。 For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. 例如,它假设没有小型int缓存,对于记录类型的年龄而言 ,这将完全重要。 On my own system, the python process is registering 2.7 gigs of memory usage (via top ). 在我自己的系统上,python进程正在注册2.7 GB的内存使用量(通过top )。

So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int objects for us, so we just have to worry about the 8-byte pointers! 因此,在我的机器中实际发生的事情更好地建模为保守字符串假设 - 平均大小为10的唯一字符串,因此没有字符串实习 - 但是对于整数而言是自由的,假设int-caching正在处理我们的int对象,所以我们只需要担心8字节的指针!

In [35]: sum(getsizeof("0123456789") + 8  for record in records)
Out[35]: 2412000000

In [36]: _ + getsizeof(records)
Out[36]: 2709842208

In [37]: _ * 1e-9
Out[37]: 2.709842208

Which is a good model for what I'm observing from top . 对于我从top观察的内容来说,这是一个很好的模型。

If you really want efficient storage 如果你真的想要高效存储

Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. 现在,如果你真的想把数据塞入ram,那么你将不得不失去Python的灵活性。 You could use the array module in combination with struct , to get C-like memory efficiency. 您可以将array模块与struct结合使用,以获得类似C的内存效率。 An easier world to wade into might be numpy instead, which allows for similar things. 一个更容易涉足的世界可能是numpy ,这允许类似的事情。 For example: 例如:

In [1]: import numpy as np

In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])

In [3]: records = np.empty((36000000), dtype=recordtype)

In [4]: records.nbytes
Out[4]: 756000000

In [5]: records.nbytes*1e-9
Out[5]: 0.756

Note, we are now allowed to be quite compact. 请注意,我们现在可以非常紧凑。 I can use 8-bit unsigned integers (ie a single byte) to represent age. 我可以使用8位无符号整数(即单个字节)来表示年龄。 However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. 但是,我立即面临一些不灵活性:如果我想要有效存储字符串,我必须定义最大尺寸。 I've used 'S20' , which is 20 characters. 我用了'S20' ,这是20个字符。 These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names. 这些是ASCII字节,但是20个ascii字符的字段可能足以满足名称。

Now, numpy gives you a lot of fast methods wrapping C-compiled code. 现在, numpy为您提供了许多包装C编译代码的快速方法。 So, just to play around with it, let's fill our records with some toy data. 所以,只是为了解决它,让我们用一些玩具数据填充我们的记录。 Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10. 名称将只是简单计数的数字串,年龄将从正态分布中选择,平均值为50,标准差为10。

In [8]: for i in range(1, 36000000+1):
   ...:     records['name'][i - 1] = b"%08d" % i
   ...:

In [9]: import random
   ...: for i in range(36000000):
   ...:     records['age'][i] = max(0, int(random.normalvariate(50, 10)))
   ...:

Now, we can use numpy to query our records . 现在,我们可以使用numpy来查询我们的records For example, if you want the indices of your records given some condition , use np.where : 例如,如果您希望记录的索引具有某些条件 ,请使用np.where

In [10]: np.where(records['age'] > 70)
Out[10]: (array([      58,      146,      192, ..., 35999635, 35999768, 35999927]),)

In [11]: idx = np.where(records['age'] > 70)[0]

In [12]: len(idx)
Out[12]: 643403

So 643403 records that have an age > 70 . 所以643403年龄> 70记录。 Now, let's try 100 : 现在,让我们试试100

In [13]: idx = np.where(records['age'] > 100)[0]

In [14]: len(idx)
Out[14]: 9

In [15]: idx
Out[15]:
array([ 2315458,  5088296,  5161049,  7079762, 15574072, 17995993,
       25665975, 26724665, 28322943])

In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
       (b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
       (b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
      dtype=[('name', 'S20'), ('age', 'u1')])

Of course, one major inflexibility is that numpy arrays are sized . 当然,一个主要的缺点是numpy数组的大小 Resizing operations are expensive. 调整大小的操作很昂贵。 Now, you could maybe wrap a numpy.array in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. 现在,你可以在一些类中包装一个numpy.array ,它将作为一个有效的主干,但在那时,你也可以使用一个真正的数据库。 Lucky for you, Python comes with sqlite . 幸运的是,Python附带了sqlite

Let's look at this 我们来看看这个

>>> import sys 
>>> sys.getsizeof({'Name': 'Jordan', 'Age': 25}) * 35000000
10080000000

So ~10 GB. 所以~10 GB。 Python is doing exactly what you are asking it to do. Python正在做你要求它做的事情。

You need to split this up into chucks and check them sequentially.Try this as a starting point 您需要将其拆分为夹头并按顺序检查它们。 以此为出发点

... 35 million dictionary objects within a list. ...列表中有3500万个字典对象。 Each dictionary object contains two key/value pairs. 每个字典对象包含两个键/值对。 eg. 例如。 {'Name': 'Jordan', 'Age': 35} {'姓名':'乔丹','年龄':35}

You're right that this manner of storage has considerable overhead. 你是对的,这种存储方式有相当大的开销。

The Flyweight Design Pattern suggests that the solution involves factoring-out the commonalities. Flyweight设计模式表明该解决方案涉及分解共性。 Here are two ideas for alternative storage of the same data with better space utilization. 以下是两种相同数据的替代存储方案,具有更好的空间利用率。

You can use __slots__ to save space on instances of classes (this suppresses the creation of per-instance dictionaries): 您可以使用__slots__来节省类的实例空间(这会禁止创建每个实例的字典):

class Person(object):
    __slots__ = ['Name', 'Age']

s = [Person('Jordan', 35), Person('Martin', 31), Person('Mary', 33)]

It is even more space-efficient to use dense data structures like a pair of parallel lists: 使用像一对并行列表这样的密集数据结构更加节省空间:

s_name = ['Jordan', 'Martin', 'Mary']
s_age = [35, 31, 33]

If there duplicates in the data, you save even more space by interning the values: 如果在数据复制,您节省更多的空间实习值:

s_name = map(intern, s_name)

Or in Python 3: 或者在Python 3中:

s_name = list(map(sys.intern, s_name)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用joblib使得python在脚本运行时消耗越来越多的RAM - Using joblib makes python consume increasing amounts of RAM as the script runs 使用python处理磁盘上的大量数据的最有效方法是什么? - What's the most efficient way to process massive amounts of data from a disk using python? 比较python中的大量字典列表 - Comparing massive lists of dictionaries in python 如何礼貌地进行大量的api调用? - How to politely make massive amounts of api calls? 如何使用字典有效地替换基于CSV的大型数组中的字符串? - How does one efficiently replace strings in a massive CSV-based array, using dictionaries? Python语句使用过多的RAM - Python statement uses excessive amounts of RAM 检查两个大型Python字典是否相等 - Checking if Two Massive Python Dictionaries are Equivalent Python,solr和大量查询:需要一些建议 - Python, solr and massive amounts of queries: need some suggestions Django ORM:以正确的方式组织大量数据 - Django ORM: Organizing massive amounts of data, the right way 是将大量数据从一个Redshift表卸载到另一个表的策略? - Strategies for unloading massive amounts of data from one Redshift table to another?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM