简体   繁体   English

创建和填充巨大的numpy 2D阵列的最快方法?

[英]Fastest way to create and fill huge numpy 2D-array?

I have to create and fill huge ( eg 96 Go, 72000 rows * 72000 columns) array with floats in each case that come from mathematical formulas. 我必须创建并填充巨大的( 例如 96 Go,72000行* 72000列)数组,每个数组中都有来自数学公式的浮点数。 The array will be computed after. 该数组将在之后计算。

import itertools, operator, time, copy, os, sys
import numpy 
from multiprocessing import Pool


def f2(x):  # more complex mathematical formulas that change according to values in *i* and *x*
    temp=[]
    for i in combine:
        temp.append(0.2*x[1]*i[1]/64.23)
    return temp

def combinations_with_replacement_counts(n, r):  #provide all combinations of r balls in n boxes
   size = n + r - 1
   for indices in itertools.combinations(range(size), n-1):
       starts = [0] + [index+1 for index in indices]
       stops = indices + (size,)
       yield tuple(map(operator.sub, stops, starts))

global combine
combine = list(combinations_with_replacement_counts(3, 60))  #here putted 60 but need 350 instead
print len(combine)
if __name__ == '__main__':
    t1=time.time()
    pool = Pool()              # start worker processes
    results = [pool.apply_async(f2, (x,)) for x in combine]
    roots = [r.get() for r in results]
    print roots [0:3]
    pool.close()
    pool.join()
    print time.time()-t1
  • What's the fastest way to create and fill such huge numpy array? 什么是创建和填充如此巨大的numpy阵列的最快方法? Filling lists then aggregate then convert into numpy array? 填充列表然后聚合然后转换为numpy数组?
  • Can we parallelize computation knowing that cases/columns/rows of the 2d-array are independent to speed-up the filling of the array? 我们可以并行计算,知道2d阵列的情况/列/行是独立的,以加速数组的填充吗? Clues/trails to optimize such computation using Multiprocessing? 使用多处理优化此类计算的线索/路径?

I know that you can create shared numpy arrays that can be changed from different threads (assuming that the changed areas don't overlap). 我知道你可以创建可以从不同线程更改的共享numpy数组(假设更改的区域不重叠)。 Here is the sketch of the code that you can use to do that (I saw the original idea somewhere on stackoverflow, edit: here it is https://stackoverflow.com/a/5550156/1269140 ) 下面是您可以使用的代码草图(我在stackoverflow上看到了原始的想法,编辑:这里是https://stackoverflow.com/a/5550156/1269140

import multiprocessing as mp ,numpy as np, ctypes

def shared_zeros(n1, n2):
    # create a 2D numpy array which can be then changed in different threads
    shared_array_base = mp.Array(ctypes.c_double, n1 * n2)
    shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
    shared_array = shared_array.reshape(n1, n2)
    return shared_array

class singleton:
    arr = None

def dosomething(i):
    # do something with singleton.arr
    singleton.arr[i,:] = i
    return i

def main():
    singleton.arr=shared_zeros(1000,1000)
    pool = mp.Pool(16)
    pool.map(dosomething, range(1000))

if __name__=='__main__':
    main()

You can create an empty numpy.memmap array with the desired shape, and then use multiprocessing.Pool to populate its values. 您可以使用所需的形状创建一个空的numpy.memmap数组,然后使用multiprocessing.Pool填充其值。 Doing it correctly would also keep memory footprint of each process in your pool relatively small. 正确地执行此操作还会使池中每个进程的内存占用量相对较小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM