处理Python中大型词典列表的最佳方法

Question

I am performing a statistical test that uses 10,000 permutations as a null distribution. 我正在执行统计测试，使用10,000个排列作为空分布。

Each of the permutations is a 10,000 key dictionary. 每个排列都是10,000个密钥字典。 Each key is a gene, each value is a set of patients corresponding to the gene. 每个关键词都是一个基因，每个值都是一组对应基因的患者。 This dictionary is programmatically generated, and can be written to and read in from a file. 该字典是以编程方式生成的，可以写入文件并从文件读入。

I want to be able to iterate over these permutations to perform my statistical test; 我希望能够迭代这些排列来执行我的统计测试; however, keeping this large list on the stack is slowing down my performance. 但是，将这个大型列表保留在堆栈上会降低我的性能。

Is there a way to keep these dictionaries on stored memory and yield the permutations as I iterate over them? 有没有办法将这些字典保存在存储的内存中，并在迭代它们时产生排列？

Thank you! 谢谢！

Answer 1

This is a general computing problem; 这是一般的计算问题; you want the speed of memory-stored data but don't have enough memory. 你想要内存存储数据的速度，但没有足够的内存。 You have at least these options: 您至少有以下选项：

Buy more RAM (obviously) 购买更多内存（显然）
Let the process swap. 让流程互换。 This leaves it to the OS to decide which data to store on disk and which to store in memory 这使得操作系统可以决定将哪些数据存储在磁盘上以及将哪些数据存储在内存中
Don't load everything into memory at once 不要一次将所有内容加载到内存中

Since you are iterating over your dataset, one solution could be to load data lazily: 由于您正在迭代数据集，因此一种解决方案可能是懒惰地加载数据：

def get_data(filename):
    with open(filename) as f:
        while True:
            line = f.readline()
            if line:
                yield line
            break

for item in get_data('my_genes.dat'):
    gather_statistics(deserialize(item))

A variant is to split your data into multiple files or store your data in a database so you can batch process your data n items at a time. 一种变体是将数据拆分为多个文件或将数据存储在数据库中，这样您就可以一次批量处理数据。

处理Python中大型词典列表的最佳方法

问题描述

1 个解决方案

解决方案1
2 2015-09-02 08:15:53

处理Python中大型词典列表的最佳方法

问题描述

1 个解决方案

解决方案1 2 2015-09-02 08:15:53

解决方案1
2 2015-09-02 08:15:53