简体   繁体   中英

Numpy huge matrix dot product while multiprocessing

I'm implementing a special case of EM-GMM.

X is the data matrix of shape [1000000, 900] and is a numpy mmap object
Q is a precision matrix of shape [900, 900] and is a ndarray

I'm also using the Multiprocessing library to go over 200 Q matrices concurrently on 40 cores, using the same data matrix (X).

It works over smaller dimensions like [1mil, 196], [1mil, 400],
but when i try to run the [1mil, 900] at some point on of the processes throws an exception:

OSError: [Errno 12] Cannot allocate memory

I guess the issue is because of 2 big calculations I have, which probably allocate big matrices.

As part of the E-step I need to calculate:
np.sum(X.dot(Q) * X, axis=1)

As part of the M-step I need to calculate (W is a [1mil, 1] weights vector):
(XT * W).dot(X)

In the future I would have to run this EM-GMM over data of even bigger size (of shape [2mil, 2500] and even [2mil, 10k])
What can I do to make those calculation more memory efficient?

EDIT:

I've noticed that the worker initialization uses pickle, so the X-matrix is turned into ndarray and the workers doesn't share it (which means the X-matrix is duplicated for all workers and fills my RAM)

I have an idea of how to solve it, and will update if it's fixed.
But If anyone has a good idea of how to deal with it I'll be grateful.

It turned out that there were 2 unrelated issues that caused the RAM overuse.

First, the memmap object was fully read from disk when pickled for the multiprocessing workers.
This duplication of the data allocated 6.7GB extra RAM for each worker.
In order to solve this, I've created a shared RawArray and loaded the data to it, and on each worker I used np.frombuffer .

Second, both X.dot(Q) and (XT * W) resulted in numpy allocating another X-shaped matrix, which is another 6.7GB RAM
I created a variation of the answer from this thread: https://stackoverflow.com/a/21096605/5572523
Since my matrix is skinny-tall, I sliced over rows:

def _block_slices(dim_size, block_size):
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count >= dim_size:
            raise StopIteration

And now I can iterate over batches of the data (added a bit of extra speedup too when dealing with weight=0)

I set max_elements = 2 ** 27 , because I'm using float64, so this results in a 1GB matrix (if I'm not wrong).

So (XT * W).dot(X) turned to:

def weighted_outer_prod(X, W):
    n, d = X.shape

    max_rows = max(1, int(max_elements / d))
    sigma = np.zeros([d, d])
    for mm in _block_slices(n, max_rows):
        sigma += batch_weighted_outer_prod(X[mm, :], W[mm])
    return sigma

def batch_weighted_outer_prod(batch, W):
    nz = W > 0
    buff = np.empty(batch[nz].shape)
    np.multiply(batch[nz], W[nz, np.newaxis], out=buff)
    sigma = buff.T.dot(batch[nz])
    del(buff)
    return sigma

And np.sum(X.dot(Q) * X, axis=1) turned to: (don't mind the function name)

def calc_l_k(X, Q):
    max_rows = max(1, int(max_elements / d))
    l_k = np.empty(n)
    for mm in _block_slices(n, max_rows):
        l_k[mm] = batch_l_k(X[mm, :], Q)

    return l_k


def batch_l_k(batch, Q):
    buff = np.empty(batch.shape)
    np.dot(batch, Q, out=buff)
    np.multiply(buff, batch, out=buff)
    l_k = -0.5 * np.sum(buff, axis=1)
    del(buff)
    return l_k

Now it runs with X of shape [1mil, 900], and I hope it'll still work with higher dimensions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM