简体   繁体   English

在python中管理字典内存大小

[英]Managing dictionary memory size in python

I have a program which imports a text file through standard input and aggregates the lines into a dictionary. 我有一个程序可以通过标准输入导入文本文件,并将行聚合到字典中。 However the input file is very large (1Tb order) and I wont have enough space to store the whole dictionary in memory (running on 64Gb ram machine). 但是输入文件非常大(1Tb顺序),我将没有足够的空间将整个字典存储在内存中(在64Gb内存计算机上运行)。 Currently Iv got a very simple clause which outputs the dictionary once it has reached a certain length (in this case 100) and clears the memory. 当前,iv有一个非常简单的子句,一旦字典达到一定长度(在这种情况下为100),它便输出字典并清除内存。 The output can then be aggregated at later point. 然后可以在以后汇总输出。

So i want to: output the dictionary once memory is full. 所以我想:一旦内存已满,输出字典。 what is the best way of managing this? 最好的管理方式是什么? Is there a function which gives me the current memory usage? 有没有可以给我当前内存使用量的函数? Is this costly to keep on checking? 继续检查是否很昂贵? Am I using the right tactic? 我使用的策略正确吗?

import sys
X_dic = dict()

# Used to print the dictionary in required format
def print_dic(dic):
    for key, value in dic.iteritems():
        print "{0}\t{1}".format(key, value)

for line in sys.stdin:
    value, key = line.strip().split(",")      

    if (not key in X_dic):
        X_dic[key] = []                            

    X_dic[key].append(value)

    # Limit size of dic.
    if( len(X_dic) == 100):
        print_dic(X_dic)              # Print and clear dictionary
        X_dic = dict()


# Now output
print_dic(X_dic)

The module resource provides some information on how much resources (memory, etc.) you are using. 模块resource提供有关正在使用多少资源(内存等)的一些信息。 See here for a nice little usage. 看到这里很好的用法。

On a Linux system (I don't know where you are) you can watch the contents of the file /proc/meminfo . 在Linux系统(我不知道您在哪里)上,您可以观看/proc/meminfo文件的内容。 As part of the proc file system it is updated automatically. 作为proc文件系统的一部分,它会自动更新。

But I object to the whole strategy of monitoring the memory and using it up as much as possible, actually. 但实际上,我反对监视内存并尽可能用尽内存的整个策略。 I'd rather propose to dump the dictionary regularly (after 1M entries have been added or such). 我宁愿建议定期转储字典(在添加1M条目等之后)。 It probably will speed up your program to keep the dict smaller than possible; 它可能会加快您的程序的速度,以使dict尽可能小。 also it presumably will have advantages for later processing if all dumps are of similar size. 如果所有转储的大小都相似,则对于以后的处理也可能具有优势。 If you dump a huge dict which fit into your whole memory when nothing else was using memory, then you later will have trouble re-reading that dict if something else is currently using some of your memory. 如果您在没有其他任何东西正在使用内存的情况下转储了一个适合您整个内存的巨大指令,那么以后如果其他东西正在使用您的某些内存,您将很难重新读取该指令。 So then you would have to create a situation in which nothing else is using memory (eg reboot or similar). 因此,您将不得不创建一种情况,其中没有其他东西正在使用内存(例如,重新启动或类似操作)。 Not very convenient. 不太方便。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM