简体   繁体   中英

Reduce memory usage of a line of code that uses numpy

I am using the python library:

https://github.com/ficusss/PyGMNormalize

For normalizing my datasets ( scRNAseq ) and the last line in the library's file utils.py :

https://github.com/ficusss/PyGMNormalize/blob/master/pygmnormalize/utils.py

uses too much of memory:

np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

Is there a good way of rewriting this line of code to improve the memory usage? I mean I have 200Gb RAM accessible on the cluster and with the matrix of something like 20Gb this line fails to work, but I beliebve there should be a way of making it working.

If all elements of matrix are >=0, then you can do:

np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)

This uses the fact that any float or integer other than 0 is interpreted as True when viewed as a boolean (which np.any does internally). Saves you from building that big boolean matrix seperately.

Since you're boolean indexing in matrix[...] , you're creating a temporary copy that you don't really care if it gets overwritten during the percentile process. Thus you can use overwrite_input = True to save even more memory.

mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?

True

Finally, depending on your other archetecture, I'd recommend looking into making matrix some flavor of scipy.sparse , which should siginficantly reduce your memory usage again (although with some drawbacks depending on the type you use).

I'm putting this as an answer since there's more than will fit in a comment, although it may not be complete. There's two suspicious things - first off percentile should run fine on a 20Gb matrix if your machine has 200Gb of ram available. That's a lot of memory, so start looking into what else might be using it. Start with top - is there another process or is your python program using all of that?

The second suspicious thing is that the documentation for utils.percentile doesn't match it's actual behavior. Here's the relevant bits from the code you've linked to:

def percentile(matrix, p):
    """
    Estimation of percentile without zeros.
    ....
    Returns
    -------
    float
        Calculated percentile.
    """
    return np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

What it actually does is return the (columnwise) percentile calculated for rows which are not all zeros. edit That's rows which contain at least one positive element. If values are non-negative that's the same thing, but in general that will be a very different result.

np.any(matrix > 0, axis=1) returns a boolean array to index rows which are not all zeros. For example

>>> np.any(array([[3, 4], [0, 0]]) > 0, axis=1)
    array([ True, False])

>>> np.any(array([[3, 4], [1, 0]]) > 0, axis=1)
    array([ True,  True])

>>> np.any(array([[3, 0], [1, 0]]) > 0, axis=1)
    array([ True,  True])

That array is used to index matrix , which selects only rows which are not all zeros and returns those. You should read over the numpy docs for indexing if you aren't familiar with that way of indexing.

Calculating that takes a lot of memory - matrix > 0 creates a boolean array of the same dimension as matrix, then the indexing creates a copy of matrix which probably contains most of the rows.
So, probably 2-4Gb for the boolean array and close to 20Gb for the copy.

That can be reduced,

## Find rows with all zeros, one row at a time to reduce memory
mask = [np.any(r > 0) for r in matrix]  
 ## Find percentile for each column, excluding rows with all zeros
perc = [np.percentile(c[mask], p) for c in matrix.T] 

However, as stated earlier that doesn't match the function documentation.

There may be a reason for this logic, but it is odd. If you don't know the reason for it you might be fine calling np.percentile directly - just check that it returns a close value for a smaller subset of your data. There's also nanpercentile , which can be used the same way but ignores nan values.
You can use boolean indexing to replace the values you don't want included with nan (ie matrix[matrix < 0] = np.nan ) and then call that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM