简体   繁体   English

numpy - 使用numpy 1d数组的置换副本构建2d数组的最快方法

[英]numpy - fastest way to build 2d array with permuted copies of numpy 1d array

>>> import numpy as np
>>> a = np.arange(5)
>>> b = desired_function(a, 4)
array([[0, 3, 4, 1],
...    [1, 2, 1, 3],
...    [2, 4, 2, 4],
...    [3, 1, 3, 0],
...    [4, 0, 0, 2]])

What I've tried so far 到目前为止我尝试过的

def repeat_and_shuffle(a, ncols):
    nrows, = a.shape
    m = np.tile(a.reshape(nrows, 1), (1, ncols))
    return m

Somehow I have to shuffle m[:,1:ncols] efficiently by column. 不知何故,我必须通过列有效地改变m[:,1:ncols]

Here is one way to create such an array: 这是创建这样一个数组的一种方法:

>>> a = np.arange(5)
>>> perms = np.argsort(np.random.rand(a.shape[0], 3), axis=0) # 3 columns
>>> np.hstack((a[:,np.newaxis], a[perms]))
array([[0, 3, 1, 4],
       [1, 2, 3, 0],
       [2, 1, 4, 1],
       [3, 4, 0, 3],
       [4, 0, 2, 2]])

This creates an array of random values of the required shape and then sorts the indices in each column by their corresponding value. 这将创建所需形状的随机值数组,然后按其相应值对每列中的索引进行排序。 This array of indices is then used to index a . 然后,使用该索引数组对a进行索引。

(The idea of using np.argsort to create an array of columns of permuted indices came from @jme's answer here .) (使用的想法np.argsort创建置换索引列的一个阵列来自@ JME的答案在这里 。)

Build the new array using random permutations of the original. 使用原始数组的随机排列构建新数组。

>>> a = np.arange(5)
>>> n = 4
>>> z = np.array([a]+[np.random.permutation(a) for _ in xrange(n-1)])
>>> z.T
array([[0, 0, 4, 3],
       [1, 1, 3, 0],
       [2, 3, 2, 4],
       [3, 2, 0, 2],
       [4, 4, 1, 1]])
>>> 

Duplicate columns are possible because of the randomness. 由于随机性,可能会出现重复的列。

This is a version of Ashwini Chaudhary's solution: 这是Ashwini Chaudhary解决方案的一个版本:

>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> a = numpy.tile(a[:,None], 5)
>>> a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
>>> a
    array([['a', 'c', 'a', 'd', 'c'],
       ['b', 'd', 'b', 'e', 'a'],
       ['c', 'e', 'd', 'a', 'e'],
       ['d', 'a', 'e', 'b', 'd'],
       ['e', 'b', 'c', 'c', 'b']], 
      dtype='|S1')

I think it's well-conceived and pedagogically useful (and I hope he undeletes it). 我认为这是精心构思和教学上有用的(我希望他能够取消它)。 But somewhat surprisingly, it's consistently the slowest one in the tests I've performed. 但是有些令人惊讶的是,它一直是我执行的测试中最慢的一个。 Definitions: 定义:

>>> def column_perms_along(a, cols):
...     a = numpy.tile(a[:,None], cols)
...     a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
...     return a
... 
>>> def column_perms_argsort(a, cols):
...     perms = np.argsort(np.random.rand(a.shape[0], cols - 1), axis=0)
...     return np.hstack((a[:,None], a[perms]))
... 
>>> def column_perms_lc(a, cols):
...     z = np.array([a] + [np.random.permutation(a) for _ in xrange(cols - 1)])
...     return z.T
... 

For small arrays and few columns: 对于小数组和少数列:

>>> %timeit column_perms_along(a, 5)
1000 loops, best of 3: 272 µs per loop
>>> %timeit column_perms_argsort(a, 5)
10000 loops, best of 3: 23.7 µs per loop
>>> %timeit column_perms_lc(a, 5)
1000 loops, best of 3: 165 µs per loop

For small arrays and many columns: 对于小型数组和许多列:

>>> %timeit column_perms_along(a, 500)
100 loops, best of 3: 29.8 ms per loop
>>> %timeit column_perms_argsort(a, 500)
10000 loops, best of 3: 185 µs per loop
>>> %timeit column_perms_lc(a, 500)
100 loops, best of 3: 11.7 ms per loop

For big arrays and few columns: 对于大型数组和少数列:

>>> A = numpy.arange(1000)
>>> %timeit column_perms_along(A, 5)
1000 loops, best of 3: 2.97 ms per loop
>>> %timeit column_perms_argsort(A, 5)
1000 loops, best of 3: 447 µs per loop
>>> %timeit column_perms_lc(A, 5)
100 loops, best of 3: 2.27 ms per loop

And for big arrays and many columns: 对于大型数组和许多列:

>>> %timeit column_perms_along(A, 500)
1 loops, best of 3: 281 ms per loop
>>> %timeit column_perms_argsort(A, 500)
10 loops, best of 3: 71.5 ms per loop
>>> %timeit column_perms_lc(A, 500)
1 loops, best of 3: 269 ms per loop

The moral of the story: always test! 故事的寓意:永远考验! I imagine that for extremely large arrays, the disadvantage of an n log n solution like sorting might become apparent here. 我想,对于非常大的数组,这样的n log n解决方案的缺点可能会变得很明显。 But numpy 's implementation of sorting is extremely well-tuned in my experience. 但根据我的经验, numpy的排序实现非常好。 I bet you could go up several orders of magnitude before noticing an effect. 我敢打赌,在注意到效果之前,您可能会上升几个数量级。

Assuming you are ultimately intending to loop over multiple 1D input arrays, you might be able to cache your permutation indices and then just take rather than permute at the point of use. 假设您最终打算循环遍历多个1D输入数组,您可能能够缓存排列索引,然后在使用时take而不是permute This can work even if the length of the 1D arrays varies: you just need to discard the permutation indices that are too large. 即使1D阵列的长度不同,这也可以工作:您只需要丢弃过大的置换索引。

Rough (partially tested) code for implementation: 粗略(部分测试)的实现代码:

def permute_multi(X, k, _cache={}):
    """For 1D input `X` of len `n`, it generates an `(k,n)` array
    giving `k` permutations of `X`."""
    n = len(X)
    cached_inds = _cache.get('inds',np.array([[]]))

    # make sure that cached_inds has shape >= (k,n)
    if cached_inds.shape[1] < n:
        _cache['inds'] = cached_inds = np.empty(shape=(k,n),dtype=int)
        for i in xrange(k):
            cached_inds[i,:] = np.random.permutation(n)
    elif cached_inds.shape[0] < k:
        pass # TODO: need to generate more rows

    inds = cached_inds[:k,:] # dispose of excess rows

    if n < cached_inds.shape[1]:
        # dispose of high indices
        inds = inds.compress(inds.ravel()<n).reshape((k,n))

    return X[inds]

Depending on your usage you might want to provide some way of clearing the cache, or at least some heuristic that can spot when the cached n and k have grown much larger than most of the common inputs. 根据您的使用情况,您可能希望提供某种清除缓存的方法,或者至少提供一些可以在缓存的nk变得比大多数常用输入大得多时发现的启发式方法。 Note that the above function gives (k,n) not (n,k) , this is because numpy defaults to rows being contiguous and we want the n -dimension to be contiguous - you could force Fortran-style if you wish, or just transpose the output (which flips a flag inside the array rather than really moving data). 请注意,上面的函数给出(k,n)而不是(n,k) ,这是因为numpy默认行是连续的,并且我们希望n维是连续的-如果愿意,可以强制使用Fortran样式,或者转置输出(在数组中翻转一个标志而不是真正移动数据)。

In terms of whether this caching concept is statistically valid, I believe that in most cases it is probably fine, since it is roughly equivalent to resetting the seed at the start of the function to a fixed constant...but if you are doing anything particularly fancy with the returned array you might need to think carefully before using this approach. 关于此缓存概念在统计上是否有效,我相信在大多数情况下它可能很好,因为它大致等效于将函数开始时的种子重置为固定常量...但是如果您正在执行任何操作特别是对于返回的数组,在使用这种方法之前,您可能需要仔细考虑。

A quick benchmark says that (once warmed up) for n=1000 and k=1000 this takes about 2.2 ms , compared to 150 ms for the full k -loop over np.random.permutation . 快速基准测试表明,(一旦预热) n=1000k=1000大约需要2.2 ms ,而np.random.permutation的完整k np.random.permutation需要150 ms Which is about 70 times faster...but that's in the simplest case where we don't call compress . 这大约快了70倍......但这是最简单的情况,我们不调用compress For n=999 and k=1000 , having warmed up with n=1000 , it takes an extra few ms, giving 8ms total time, which is still about 19 times faster than the k -loop. 对于n=999k=1000 ,具有升温用n=1000 ,它需要一个额外的数毫秒,给出8ms总时间,这仍然比快约19倍k -loop。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM