简体   繁体   English

寻找 RAM 有效的方法来在 Python 中并行比较许多分布

[英]Looking for RAM efficient way to compare many distributions in parallel in Python

I have to compare each distribution of measurement n, with all other measurements.我必须将测量 n 的每个分布与所有其他测量进行比较。 I have about 500 measurements and 5000 distributions per measurement, so that's a lot of comparisons.我有大约 500 次测量和每次测量 5000 次分布,所以有很多比较。 I have the data in one csv file:我在一个 csv 文件中有数据:

distribution 1分布1 distribution 2分布2
measurement 1测量 1 [10,23,14,16,28,19,28] [10,23,14,16,28,19,28] [4,1,3,2,5,8,4,2,4,6] [4,1,3,2,5,8,4,2,4,6]
measurement 2测量 2 [11,23,24,10,27,19,27] [11,23,24,10,27,19,27] [9,2,5,2,5,7,3,2,4,1] [9,2,5,2,5,7,3,2,4,1]

as you can imagine the file is huge and as I have to do many comparisons I run it in parallel and the RAM consumption is insane.正如您可以想象的那样,该文件很大,并且由于我必须进行许多比较,因此我并行运行它,并且 RAM 消耗是疯狂的。 If I split the file and only open sample by sample, it's a bit better, but still not good and also it's not very efficient.如果我拆分文件并仅逐个打开样本,它会好一点,但仍然不好,而且效率也不是很高。

My idea was to create some kind of database and query only the cells needed, but have never done it, so I don't know if that will be RAM heavy and fairly efficient.我的想法是创建某种数据库并仅查询所需的单元格,但从未这样做过,所以我不知道这是否会占用大量内存并且相当有效。

This probably has something to do with destroying objects.这可能与破坏对象有关。 The way to limit RAM usage would be to limit the number of threads.限制 RAM 使用的方法是限制线程数。 Then you don't start every comparison at the beginning and then solve them by four (assuming you have four threads per process) to end an hour later to let the garbage collector start destroying objects of the solved cases.然后,您不要从一开始就开始每个比较,然后将它们解决四个(假设每个进程有四个线程)到一个小时后结束,让垃圾收集器开始销毁已解决案例的对象。

I am just spitballing here.我只是在这里吐口水。 A bit of code would be helpful.一些代码会有所帮助。 Maybe you are already doing that?也许你已经在这样做了?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM