[英]Best Way to handle Large List of Dictionaries in Python
I am performing a statistical test that uses 10,000 permutations as a null distribution. 我正在执行统计测试,使用10,000个排列作为空分布。
Each of the permutations is a 10,000 key dictionary. 每个排列都是10,000个密钥字典。 Each key is a gene, each value is a set of patients corresponding to the gene.
每个关键词都是一个基因,每个值都是一组对应基因的患者。 This dictionary is programmatically generated, and can be written to and read in from a file.
该字典是以编程方式生成的,可以写入文件并从文件读入。
I want to be able to iterate over these permutations to perform my statistical test; 我希望能够迭代这些排列来执行我的统计测试; however, keeping this large list on the stack is slowing down my performance.
但是,将这个大型列表保留在堆栈上会降低我的性能。
Is there a way to keep these dictionaries on stored memory and yield the permutations as I iterate over them? 有没有办法将这些字典保存在存储的内存中,并在迭代它们时产生排列?
Thank you! 谢谢!
This is a general computing problem; 这是一般的计算问题; you want the speed of memory-stored data but don't have enough memory.
你想要内存存储数据的速度,但没有足够的内存。 You have at least these options:
您至少有以下选项:
Since you are iterating over your dataset, one solution could be to load data lazily: 由于您正在迭代数据集,因此一种解决方案可能是懒惰地加载数据:
def get_data(filename):
with open(filename) as f:
while True:
line = f.readline()
if line:
yield line
break
for item in get_data('my_genes.dat'):
gather_statistics(deserialize(item))
A variant is to split your data into multiple files or store your data in a database so you can batch process your data n items at a time. 一种变体是将数据拆分为多个文件或将数据存储在数据库中,这样您就可以一次批量处理数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.