简体   繁体   中英

Looking for RAM efficient way to compare many distributions in parallel in Python

I have to compare each distribution of measurement n, with all other measurements. I have about 500 measurements and 5000 distributions per measurement, so that's a lot of comparisons. I have the data in one csv file:

distribution 1 distribution 2
measurement 1 [10,23,14,16,28,19,28] [4,1,3,2,5,8,4,2,4,6]
measurement 2 [11,23,24,10,27,19,27] [9,2,5,2,5,7,3,2,4,1]

as you can imagine the file is huge and as I have to do many comparisons I run it in parallel and the RAM consumption is insane. If I split the file and only open sample by sample, it's a bit better, but still not good and also it's not very efficient.

My idea was to create some kind of database and query only the cells needed, but have never done it, so I don't know if that will be RAM heavy and fairly efficient.

This probably has something to do with destroying objects. The way to limit RAM usage would be to limit the number of threads. Then you don't start every comparison at the beginning and then solve them by four (assuming you have four threads per process) to end an hour later to let the garbage collector start destroying objects of the solved cases.

I am just spitballing here. A bit of code would be helpful. Maybe you are already doing that?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM