简体   繁体   English

每行的快速列shuffle numpy

[英]Fast column shuffle of each row numpy

I have a large 10,000,000+ length array that contains rows. 我有一个大的10,000,000长度数组,包含行。 I need to individually shuffle those rows. 我需要单独洗牌那些行。 For example: 例如:

[[1,2,3]
 [1,2,3]
 [1,2,3]
 ...
 [1,2,3]]

to

[[3,1,2]
 [2,1,3]
 [1,3,2]
 ...
 [1,2,3]]

I'm currently using 我正在使用

map(numpy.random.shuffle, array)

But it's a python (not NumPy) loop and it's taking 99% of my execution time. 但它是一个python(而不是NumPy)循环,它占用了99%的执行时间。 Sadly, the PyPy JIT doesn't implement numpypy.random , so I'm out of luck. 可悲的是,PyPy JIT没有实现numpypy.random ,所以我运气不好。 Is there any faster way? 有没有更快的方法? I'm willing to use any library ( pandas , scikit-learn , scipy , theano , etc. as long as it uses a Numpy ndarray or a derivative.) 我愿意使用任何库( pandasscikit-learnscipytheano等等,只要它使用Numpy ndarray或衍生物。)

If not, I suppose I'll resort to Cython or C++. 如果没有,我想我会使用Cython或C ++。

If the permutations of the columns are enumerable, then you could do this: 如果列的排列是可枚举的,那么您可以这样做:

import itertools as IT
import numpy as np

def using_perms(array):
    nrows, ncols = array.shape
    perms = np.array(list(IT.permutations(range(ncols))))
    choices = np.random.randint(len(perms), size=nrows)
    i = np.arange(nrows).reshape(-1, 1)
    return array[i, perms[choices]]

N = 10**7
array = np.tile(np.arange(1,4), (N,1))
print(using_perms(array))

yields (something like) 收益率(类似)

[[3 2 1]
 [3 1 2]
 [2 3 1]
 [1 2 3]
 [3 1 2]
 ...
 [1 3 2]
 [3 1 2]
 [3 2 1]
 [2 1 3]
 [1 3 2]]

Here is a benchmark comparing it to 这是一个比较它的基准

def using_shuffle(array):
    map(numpy.random.shuffle, array)
    return array

In [151]: %timeit using_shuffle(array)
1 loops, best of 3: 7.17 s per loop

In [152]: %timeit using_perms(array)
1 loops, best of 3: 2.78 s per loop

Edit: CT Zhu's method is faster than mine: 编辑:CT朱的方法比我的快:

def using_Zhu(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return (array.flatten()[(b+3*np.arange(nrows)[...,np.newaxis]).flatten()]
            ).reshape(array.shape)

In [177]: %timeit using_Zhu(array)
1 loops, best of 3: 1.7 s per loop

Here is a slight variation of Zhu's method which may be even a bit faster: 这是朱的方法的略微变化,甚至可能更快一点:

def using_Zhu2(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return array.take((b+3*np.arange(nrows)[...,np.newaxis]).ravel()).reshape(array.shape)

In [201]: %timeit using_Zhu2(array)
1 loops, best of 3: 1.46 s per loop

Here are some ideas: 以下是一些想法:

In [10]: a=np.zeros(shape=(1000,3))

In [12]: a[:,0]=1

In [13]: a[:,1]=2

In [14]: a[:,2]=3

In [17]: %timeit map(np.random.shuffle, a)
100 loops, best of 3: 4.65 ms per loop

In [21]: all_perm=np.array((list(itertools.permutations([0,1,2]))))

In [22]: b=all_perm[np.random.randint(0,6,size=1000)]

In [25]: %timeit (a.flatten()[(b+3*np.arange(1000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1000 loops, best of 3: 393 us per loop

If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). 如果只有几列,则所有可能排列的数量远小于数组中的行数(在这种情况下,当只有3列时,只有6个可能的排列)。 A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations. 使其更快的一种方法是首先进行所有排列,然后通过从所有可能的排列中随机选择一个排列来重新排列每一行。

It still appears to be 10 times faster even with larger dimension: 即使尺寸较大,它仍然会快10倍:

#adjust a accordingly
In [32]: b=all_perm[np.random.randint(0,6,size=1000000)]

In [33]: %timeit (a.flatten()[(b+3*np.arange(1000000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1 loops, best of 3: 348 ms per loop

In [34]: %timeit map(np.random.shuffle, a)
1 loops, best of 3: 4.64 s per loop

You can also try the apply function in pandas 您也可以在pandas中尝试apply函数

import pandas as pd

df = pd.DataFrame(array)
df = df.apply(lambda x:np.random.shuffle(x) or x, axis=1)

And then extract the numpy array from the dataframe 然后从数据帧中提取numpy数组

print df.values

I believe I have an alternate, equivalent strategy, building upon the previous answers: 我相信我有一个替代的,等效的策略,建立在以前的答案:

# original sequence
a0 = np.arange(3) + 1

# length of original sequence
L = a0.shape[0]

# number of random samples/shuffles
N_samp = 1e4

# from above
all_perm = np.array( (list(itertools.permutations(np.arange(L)))) )
b = all_perm[np.random.randint(0, len(all_perm), size=N_samp)]

# index a with b for each row of b and collapse down to expected dimension
a_samp = a0[np.newaxis, b][0]

I'm not sure how this compares performance-wise, but I like it for its readability. 我不确定这是如何比较性能的,但我喜欢它的可读性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM