简体   繁体   English

Python中内存高效的大块numpy数组

[英]Memory efficient sort of massive numpy array in Python

I need to sort a VERY large genomic dataset using numpy. 我需要使用numpy对非常大的基因组数据集进行排序。 I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. 我有一个26亿浮点数的数组,维度= (868940742, 3) ,一旦加载并坐在那里,我的机器上占用大约20GB的内存。 I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. 我有一台2015年初的13英寸MacBook Pro,配备16GB内存,500GB固态高清和3.1 GHz intel i7处理器。 Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing. 只是加载数组溢出到虚拟内存,但没有到我的机器遭受的程度,或者我必须停止我正在做的其他事情。

I build this VERY large array step by step from 22 smaller (N, 2) subarrays. 我从22个较小的(N, 2)子阵列逐步构建了这个非常大的数组。

Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr . 函数FUN_1使用22个子数组中的每一个生成2个新的(N, 1)数组,我称之为sub_arr

The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]) . FUN_1的第一个输出是通过插入数组b = array([X, F(X)])上的sub_arr[:,0]值生成的,第二个输出是通过使用数组将sub_arr[:, 0]放入bin中生成的r = array([X, BIN(X)]) I call these outputs b_arr and rate_arr , respectively. 我分别称这些输出为b_arrrate_arr The function returns a 3-tuple of (N, 1) arrays: 该函数返回(N, 1)数组的3元组:

import numpy as np

def FUN_1(sub_arr):
    """interpolate b values and rates based on position in sub_arr"""

    b = np.load(bfile)
    r = np.load(rfile)

    b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
    rate_arr = np.searchsorted(r[:,0], sub_arr[:,0])  # HUGE efficiency gain over np.digitize...

    return r[rate_r, 1], b_arr, sub_arr[:,1] 

I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values: 我在for循环中调用该函数22次并使用以下值填充预先分配的零值full_arr = numpy.zeros([868940742, 3])

full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1

In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. 在这一步节省内存方面,我认为这是我能做的最好的,但我愿意接受建议。 Either way, I don't run into problems up through this point and it only takes about 2 minutes. 无论哪种方式,我都没有遇到问题直到这一点,它只需要大约2分钟。

Here is the sorting routine (there are two consecutive sorts) 这是排序例程(有两个连续的排序)

for idx in range(2):
    sort_idx = numpy.argsort(full_arr[:,idx])
    full_arr = full_arr[sort_idx]
    # ...
    # <additional processing, return small (1000, 3) array of stats>

Now this sort had been working, albeit slowly (takes about 10 minutes). 现在这种情况一直在起作用,虽然很慢(大约需要10分钟)。 However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same. 但是,我最近开始在FUN_1中使用更大,更精细的[X, F(X)]值分辨率表进行上面的插值步骤,返回b_arr ,现在SORT确实减慢了,尽管其他一切都保持不变。

Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. 有趣的是,我甚至没有在排序现在滞后的步骤中对插值进行排序。 Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; 以下是不同插值文件的一些片段 - 较小的一个在每种情况下小约30%,在第二列中的值更加均匀; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...? 较慢的一个具有更高的分辨率和更多的唯一值,因此插值的结果可能更独特,但我不确定这是否应该有任何影响......?

bigger, slower file: 更大,更慢的文件:

17399307    99.4
17493652    98.8
17570460    98.2
17575180    97.6
17577127    97
17578255    96.4
17580576    95.8
17583028    95.2
17583699    94.6
17584172    94

smaller, more uniform regular file: 更小,更统一的常规文件:

1       24  
1001    24  
2001    24  
3001    24  
4001    24  
5001    24
6001    24
7001    24

I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case! 我不确定是什么导致这个问题,我会对任何建议感兴趣,或者只是关于这种类型的内存限制情况下的排序的一般性输入!

At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. 目前,对np.argsort每次调用np.argsort生成一个(868940742, 1) int64索引数组,这将自己占用大约7 GB。 Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view . 此外,当您使用这些索引对full_arr的列进行排序时,您将生成另一个(868940742, 1)浮点数组,因为花式索引始终返回副本而不是视图

One fairly obvious improvement would be to sort full_arr in place using its .sort() method . 一个相当明显的改进是使用.sort()方法full_arr进行排序。 Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. 不幸的是, .sort()不允许您直接指定要排序的行或列。 However, you can specify a field to sort by for a structured array. 但是,您可以指定要对结构化数组进行排序的字段。 You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields: 因此,您可以通过将数组view作为具有三个浮点字段的结构化数组,然后按以下字段之一进行排序来强制对三列中的一列进行就地排序:

full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)

In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. 在这种情况下,我正在通过第0个字段对full_arr进行排序,该字段对应于第一列。 Note that I've assumed that there are three float64 columns ( 'f8' ) - you should change this accordingly if your dtype is different. 请注意,我假设有三个float64列( 'f8' ) - 如果你的dtype不同,你应该相应地改变它。 This also requires that your array is contiguous and in row-major format, ie full_arr.flags.C_CONTIGUOUS == True . 这也要求您的数组是连续的并且采用行主格式,即full_arr.flags.C_CONTIGUOUS == True

Credit for this method should go to Joe Kington for his answer here . 信用此方法应该去乔金顿他的回答在这里


Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question ). 虽然它需要更少的内存,但遗憾的是,与使用np.argsort生成索引数组相比,按字段对结构化数组进行排序要慢得多,正如您在下面的注释中所提到的(请参阅前一个问题 )。 If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array: 如果使用np.argsort获取要排序的索引集,则可能会通过使用np.take而不是直接索引来获取排序数组,从而获得适度的性能提升:

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop

However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy. 但是我不希望在内存使用方面看到任何差异,因为两种方法都会生成副本。


Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. 关于为什么对第二个数组进行排序更快的问题 - 是的,当数组中的唯一值较少时,您应该期望任何合理的排序算法更快,因为平均来说它的工作量较少。 Suppose I have a random sequence of digits between 1 and 10: 假设我有一个1到10之间的随机数字序列:

5  1  4  8  10  2  6  9  7  3

There are 10! 有10个! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. = 3628800这些数字的可能方式,但只有一个数字按升序排列。 Now suppose there are just 5 unique digits: 现在假设只有5个唯一数字:

4  4  3  2  3  1  2  5  1  5

Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering. 现在有2⁵= 32种方式按升序排列这些数字,因为我可以在排序的向量中交换任何一对相同的数字,而不会破坏排序。

By default, np.ndarray.sort() uses Quicksort . 默认情况下, np.ndarray.sort()使用Quicksort The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. 此算法的qsort变体通过递归选择数组中的'pivot'元素,然后重新排序数组,使得小于pivot值的所有元素都放在它之前,并且放置大于pivot值的所有元素在它之后。 Values that are equal to the pivot are already sorted. 已经对已经等于枢轴的值进行了排序。 Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array. 具有较少的唯一值意味着平均而言,在任何给定扫描中,更多值将等于枢轴值,因此完全排序阵列需要更少的扫描。

For example: 例如:

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop

In this example the dtypes of the two arrays are the same. 在此示例中,两个数组的dtypes相同。 If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller. 如果较小的阵列与较大的阵列相比具有较小的项目大小,那么由于花式索引而复制它的成本也会更小。

EDIT: In case anyone new to programming and numpy comes across this post, I want to point out the importance of considering the np.dtype that you are using. 编辑:如果有任何新的编程和numpy遇到这篇文章,我想指出考虑你正在使用的np.dtype的重要性。 In my case, I was actually able to get away with using half-precision floating point, ie np.float16 , which reduced a 20GB object in memory to 5GB and made sorting much more manageable. 在我的情况下,我实际上能够使用半精度浮点数,即np.float16 ,它将内存中的20GB对象减少到5GB,并使排序更易于管理。 The default used by numpy is np.float64 , which is a lot of precision that you may not need. numpy使用的默认值是np.float64 ,这是你可能不需要的很多精度。 Check out the doc here, which describes the capacity of the different data types. 查看此处的doc ,其中描述了不同数据类型的容量。 Thanks to @ali_m for pointing this out in the comments. 感谢@ali_m在评论中指出了这一点。

I did a bad job explaining this question but I have discovered some helpful workarounds that I think would be useful to share for anyone who needs to sort a truly massive numpy array. 我做了一个糟糕的工作来解释这个问题,但我发现了一些有用的解决方法,我认为这对于需要对一个真正庞大的numpy数组进行排序的人来说是有用的。

I am building a very large numpy array from 22 "sub-arrays" of human genome data containing the elements [position, value] . 我正在从人类基因组数据的22个“子阵列”构建一个非常大的numpy数组,其中包含元素[position, value] Ultimately, the final array must be numerically sorted "in place" based on the values in a particular column and without shuffling the values within rows. 最终,最终数组必须根据特定列中的值进行“就地”数字排序,而不必对行内的值进行混洗。

The sub-array dimensions follow the form: 子数组维度遵循以下形式:

arr1.shape = (N1, 2)
...
arr22.shape = (N22, 2)

sum([N1..N2]) = 868940742 ie there are close to 1BN positions to sort. sum([N1..N2]) = 868940742即接近1BN位置进行排序。

First I process the 22 sub-arrays with the function process_sub_arrs , which returns a 3-tuple of 1D arrays the same length as the input. 首先,我使用函数process_sub_arrs处理22个子数组,该函数返回与输入长度相同的3元组1D数组。 I stack the 1D arrays into a new (N, 3) array and insert them into an np.zeros array initialized for the full dataset: 我将1D数组堆叠成一个新的(N, 3)数组,并将它们插入到为完整数据集初始化的np.zeros数组中:

    full_arr = np.zeros([868940742, 3])
    i, j = 0, 0

    for arr in list(arr1..arr22):  
        # indices (i, j) incremented at each loop based on sub-array size
        j += len(arr)
        full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
        i = j

    return full_arr

EDIT: Since I realized my dataset could be represented with half-precision floats, I now initialize full_arr as follows: full_arr = np.zeros([868940742, 3], dtype=np.float16) , which is only 1/4 the size and much easier to sort. 编辑:因为我意识到我的数据集可以用半精度浮点数表示,我现在初始化full_arr如下: full_arr = np.zeros([868940742, 3], dtype=np.float16) ,这只是大小的1/4并且更容易排序。

Result is a massive 20GB array: 结果是一个巨大的20GB阵列:

full_arr.nbytes = 20854577808

As @ali_m pointed out in his detailed post, my earlier routine was inefficient: 正如@ali_m在他的详细帖子中指出的那样,我早期的例行程序效率低下:

sort_idx = np.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]

the array sort_idx , which is 33% the size of full_arr , hangs around and wastes memory after sorting full_arr . 数组sort_idx ,它是full_arr大小的33%,在排序full_arr之后会挂起并浪费内存。 This sort supposedly generates a copy of full_arr due to "fancy" indexing, potentially pushing memory use to 233% of what is already used to hold the massive array! 由于“花式”索引,这种类型可能会生成full_arr的副本,可能full_arr内存使用量提高到已经用于保存大规模阵列的233%! This is the slow step, lasting about ten minutes and relying heavily on virtual memory. 这是一个缓慢的步骤,持续大约十分钟,并严重依赖虚拟内存。

I'm not sure the "fancy" sort makes a persistent copy however. 我不确定“花哨”类型是否会产生持久性副本。 Watching the memory usage on my machine, it seems that full_arr = full_arr[sort_idx] deletes the reference to the unsorted original, because after about 1 second all that is left is the memory used by the sorted array and the index, even if there is a transient copy. 看着我机器上的内存使用情况,似乎full_arr = full_arr[sort_idx]删除对未分类原始文件的引用,因为大约1秒后剩下的就是排序数组和索引使用的内存,即使有暂时的副本。

A more compact usage of argsort() to save memory is this one: 使用argsort()更加紧凑以节省内存是这样的:

    full_arr = full_arr[full_arr[:,idx].argsort()]

This still causes a spike at the time of the assignment, where both a transient index array and a transient copy are made, but the memory is almost instantly freed again. 这仍然会导致分配时出现峰值,其中瞬态索引数组和瞬态副本都已生成,但内存几乎立即被释放。

@ali_m pointed out a nice trick (credited to Joe Kington) for generating a de facto structured array with a view on full_arr . @ali_m指出了一个很好的技巧(记入Joe Kington),用于生成一个事实上的结构化数组,其中包含full_arr view The benefit is that these may be sorted "in place", maintaining stable row order: 好处是这些可以“就地”排序,保持稳定的行顺序:

full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)

Views work great for performing mathematical array operations, but for sorting it is far too inefficient for even a single sub-array from my dataset. 视图对于执行数学数组操作非常有用,但是对于排序来说,即使是来自我的数据集的单个子数组也是如此。 In general, structured arrays just don't seem to scale very well even though they have really useful properties. 一般来说,结构化数组似乎不能很好地扩展,即使它们具有非常有用的属性。 If anyone has any idea why this is I would be interested to know. 如果有人知道为什么这是我有兴趣知道。

One good option to minimize memory consumption and improve performance with very large arrays is to build a pipeline of small, simple functions. 使用非常大的数组最小化内存消耗和提高性能的一个好方法是构建一个小而简单的函数管道。 Functions clear local variables once they have completed so if intermediate data structures are building up and sapping memory this can be a good solution. 函数一旦完成就会清除局部变量,所以如果中间数据结构正在构建并且使内存损坏,这可能是一个很好的解决方案。

This a sketch of the pipeline I've used to speed up the massive array sort: 这是我用来加速大规模数组排序的管道草图:

def process_sub_arrs(arr):
    """process a sub-array and return a 3-tuple of 1D values arrays"""

    return values1, values2, values3

def build_arr():
    """build the initial array by joining processed sub-arrays"""

    full_arr = np.zeros([868940742, 3])
    i, j = 0, 0

    for arr in list(arr1..arr22):  
        # indices (i, j) incremented at each loop based on sub-array size
        j += len(arr)
        full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
        i = j

    return full_arr

def sort_arr():
    """return full_arr and sort_idx"""

    full_arr = build_arr()
    sort_idx = np.argsort(full_arr[:, index])

    return full_arr[sort_idx]

def get_sorted_arr():
    """call through nested functions to return the sorted array"""

    sorted_arr = sort_arr()
    <process sorted_arr>

    return statistics

call stack: get_sorted_arr --> sort_arr --> build_arr --> process_sub_arrs 调用堆栈:get_sorted_arr - > sort_arr - > build_arr - > process_sub_arrs

Once each inner function is completed get_sorted_arr() finally just holds the sorted array and then returns a small array of statistics. 一旦每个内部函数完成, get_sorted_arr()最终只保存已排序的数组,然后返回一get_sorted_arr()统计信息。

EDIT: It is also worth pointing out here that even if you are able to use a more compact dtype to represent your huge array, you will want to use higher precision for summary calculations. 编辑:这里也值得指出,即使你能够使用更紧凑的dtype来表示你庞大的数组,你也会想要使用更高的精度来进行汇总计算。 For example, since full_arr.dtype = np.float16 , the command np.mean(full_arr[:,idx]) tries to calculate the mean in half-precision floating point, but this quickly overflows when summing over a massive array. 例如,由于full_arr.dtype = np.float16 ,命令np.mean(full_arr[:,idx])尝试计算半精度浮点的平均值,但是当对大量数组进行求和时,这会快速溢出。 Using np.mean(full_arr[:,idx], dtype=np.float64) will prevent the overflow. 使用np.mean(full_arr[:,idx], dtype=np.float64)将防止溢出。

I posted this question initially because I was puzzled by the fact that a dataset of identical size suddenly began choking up my system memory, although there was a big difference in the proportion of unique values in the new "slow" set. 我最初发布这个问题是因为我感到困惑的是,相同大小的数据集突然开始窒息我的系统内存,尽管新“慢”集中的唯一值的比例存在很大差异。 @ali_m pointed out that, indeed, more uniform data with fewer unique values is easier to sort: @ali_m指出,实际上,具有较少唯一值的更均匀数据更容易排序:

The qsort variant of Quicksort works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Quicksort的qsort变体通过递归地选择数组中的'pivot'元素,然后重新排序数组,使得小于pivot值的所有元素都放在它之前,并且所有大于pivot值的元素都放在之后它。 Values that are equal to the pivot are already sorted, so intuitively, the fewer unique values there are in the array, the smaller the number of swaps there are that need to be made. 与枢轴相等的值已经排序,因此直观地说,阵列中的唯一值越少,需要进行的交换次数就越少。

On that note, the final change I ended up making to attempt to resolve this issue was to round the newer dataset in advance, since there was an unnecessarily high level of decimal precision leftover from an interpolation step. 在这一点上,我最终试图解决这个问题的最后一个改变是提前舍入新的数据集,因为插值步骤中存在不必要的高级小数精度。 This ultimately had an even bigger effect than the other memory saving steps, showing that the sort algorithm itself was the limiting factor in this case. 这最终比其他节省内存的步骤具有更大的影响,表明排序算法本身就是这种情况下的限制因素。

Look forward to other comments or suggestions anyone might have on this topic, and I almost certainly misspoke about some technical issues so I would be glad to hear back :-) 期待任何人可能对此主题提出的其他意见或建议,我几乎肯定会对某些技术问题感到错误,所以我很乐意回复:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM