简体   繁体   English

改组数组中每一行的非零元素-Python / NumPy

[英]Shuffling non-zero elements of each row in an array - Python / NumPy

I have a an array that is relatively sparse, and I would like to go through each row and shuffle only the non-zero elements. 我有一个相对稀疏的数组,我想遍历每一行并只对非零元素进行随机排序。

Example Input: 输入示例:

[2,3,1,0]
[0,0,2,1]

Example Output: 示例输出:

[2,1,3,0]
[0,0,1,2]

Note how the zeros have not changed position. 注意零点如何保持不变。

To shuffle all elements in each row (including zeros) I can do this: 要随机播放每一行中的所有元素(包括零),我可以这样做:

for i in range(len(X)):
    np.random.shuffle(X[i, :])

What I tried to do then is this: 我当时想做的是:

for i in range(len(X)):
    np.random.shuffle(X[i, np.nonzero(X[i, :])])

But it has no effect. 但这没有效果。 I've noticed that the return type of X[i, np.nonzero(X[i, :])] is different from X[i, :] which might be the cause. 我注意到X[i, np.nonzero(X[i, :])]的返回类型不同于X[i, np.nonzero(X[i, :])] X[i, :] ,这可能是原因所在。

In[30]: X[i, np.nonzero(X[i, :])]
Out[30]: array([[23,  5, 29, 11, 17]])

In[31]: X[i, :]
Out[31]: array([23,  5, 29, 11, 17])

You could use the non-inplace numpy.random.permutation with explicit non-zero indexing: 您可以将非numpy.random.permutation与显式非零索引一起使用:

>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
...     idx = np.nonzero(X[i])
...     X[i][idx] = np.random.permutation(X[i][idx])
... 
>>> X
array([[3, 2, 1, 0],
       [0, 0, 2, 1]])

I think I found the three-liner? 我想我找到了三班轮?

i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]

As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. 如所承诺的那样,这是赏金期的第四天,这是我对向量化解决方案的尝试。 The steps involved are explained in some details below : 下面详细介绍了其中涉及的步骤:

  • For easy reference, let's call the input array as a . 为了便于参考,我们称之为输入数组作为a Generate unique indices per row that covers the range for row length. 每行生成唯一的索引,该索引涵盖行长度的范围。 For this, we can simply generate random numbers of the same shape as the input array and get the argsort indices along each row, which would be those unique indices. 为此,我们可以简单地生成与输入数组相同形状的随机数,并沿每一行获取argsort索引,这将是那些唯一索引。 This idea has been explored before in this post . this post之前已经探讨过这个想法。

  • Index into each row of input array with those indices as columns indices. 使用这些索引作为列索引来索引输入数组的每一行。 Thus, we would need advanced-indexing here. 因此,我们将需要在此处进行advanced-indexing Now, this gives us an array with each row being shuffled. 现在,这为我们提供了一个数组,其中每一行都被随机排列。 Let's call it b . 我们称它为b

  • Since the shuffling is restricted to per row, if we simply use the boolean-indexing : b[b!=0] , we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. 由于改组仅限于每行,因此,如果我们仅使用boolean-indexing: b[b!=0] ,我们将获得改组的非零元素,并且还将其限制为每行非零的长度。 This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. 这是因为NumPy数组中的元素是以行优先的顺序存储的,因此使用布尔索引时,它会先选择每行的改组后的非零元素,然后再移至下一行。 Again, if we use boolean-indexing similarly for a , ie a[a!=0] , we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. 同样,如果我们对a相似地使用布尔索引,即a[a!=0] ,则在移至下一行之前,我们将先在每一行上获得非零元素,并且它们将保持原始顺序。 So, the final step would be to just grab masked elements b[b!=0] and assign into the masked places a[a!=0] . 因此,最后一步是仅获取蒙版元素b[b!=0]并将其分配给蒙版位置a[a!=0]

Thus, an implementation covering the above mentioned three steps would be - 因此,涵盖以上三个步骤的实施方案将是-

m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx]          #step2  
a[a!=0] = b[b!=0]                              #step3 

A sample step-by-step run might make things clearer - 逐步进行示例操作可能会使情况更清楚-

In [50]: a # Input array
Out[50]: 
array([[ 8,  5,  0, -4],
       [ 0,  6,  0,  3],
       [ 8,  5,  0, -4]])

In [51]: m,n = a.shape # Store shape information

# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)

In [53]: rand_idx
Out[53]: 
array([[0, 2, 3, 1],
       [1, 0, 3, 2],
       [2, 3, 0, 1]])

# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]

# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8,  5, -4,  6,  3,  8,  5, -4])

In [56]: b[b!=0]
Out[56]: array([ 8, -4,  5,  6,  3, -4,  8,  5])

# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]

In [58]: a # Final verification on desired result
Out[58]: 
array([[ 8, -4,  0,  5],
       [ 0,  6,  0,  3],
       [-4,  8,  0,  5]])

Benchmarking for the vectorized solutions 向量化解决方案的基准测试

We are looking to benchmark vectorized solutions in this post. 我们希望在本文中对矢量化解决方案进行基准测试。 Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. 现在,矢量化试图避免循环,我们将循环遍历每一行并进行改组。 So, the setup for the input array involves a greater number of rows. 因此,输入数组的设置涉及更多的行。

Approaches - 方法-

def app1(a): # @Daniel F's soln
    i, j = np.nonzero(a.astype(bool))
    k = np.argsort(i + np.random.rand(i.size))
    a[i,j] = a[i,j[k]]
    return a

def app2(x): # @kazemakase's soln
    r, c = np.where(x != 0)
    n = c.size
    perm = np.random.permutation(n)
    i = np.argsort(perm + r * n)
    x[r, c] = x[r, c[i]]
    return x

def app3(a): # @Divakar's soln
    m,n = a.shape
    rand_idx = np.random.rand(m,n).argsort(axis=1)
    b = a[np.arange(m)[:,None], rand_idx]
    a[a!=0] = b[b!=0]
    return a

from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
                                func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Benchmarking procedure #1 基准程序#1

For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. 为了公平地进行基准测试,使用OP的示例并将其简单地堆叠为更多的行以获得更大的数据集似乎是合理的。 Thus, with that setup we could create two cases with 2 million and 20 million rows datasets. 因此,通过该设置,我们可以创建具有200万行和2000万行数据集的两个案例。

Case #1 : Large dataset with 2*1000,000 rows 案例1:具有2*1000,000行的大型数据集

In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [175]: a = np.vstack([a]*1000000)

In [176]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: %timeit app4(a)
     ...: 
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop

Case #2 : Larger dataset with 2*10,000,000 rows 情况2:具有2*10,000,000行的较大数据集

In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [178]: a = np.vstack([a]*10000000)

# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: 
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop

Benchmarking procedure #2 : Extensive one 基准测试程序2:范围广泛

To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. 为了涵盖输入数组中非零百分比变化的所有情况,我们涵盖了一些广泛的基准测试场景。 Also, since app4 seemed much slower than others, for a closer inspection we are skipping that one in this section. 另外,由于app4似乎比其他应用慢很多,因此为了更仔细地检查,我们在本节中跳过了该应用。

Helper function to setup input array : 辅助函数来设置输入数组:

def in_data(n_col, nnz_ratio):
    # max no. of elems that my system can handle, i.e. stretching it to limits.
    # The idea is to use this to decide the number of rows and always use
    # max. possible dataset size
    num_elem = 10000000

    n_row = num_elem//n_col
    a = np.zeros((n_row, n_col),dtype=int)
    L = int(round(a.size*nnz_ratio))
    a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
    return a

Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) - 主要的计时和绘图脚本(使用IPython魔术功能。因此,需要运行opon复制并将其粘贴到IPython控制台上)-

import matplotlib.pyplot as plt

# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])

init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))

timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
    for j,n_col in enumerate(n_cols):
        a = in_data(n_col, nnz_ratio=nnz_ratio)
        for func in timings:
            res = %timeit -oq func(a)
            timings[func][i,j] = res.best
            print func.__name__, i, j, res.best

fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
    ax = plt.subplot(2,2,i+1)
    for f,func in enumerate(timings):
        ax.plot(n_cols, 
                [time for time in timings[func][i]], 
                label=str(func.__name__), color=colors[f])
    ax.set_xlabel('No. of cols')
    ax.set_ylabel('time [seconds]')
    ax.grid(which='both')
    ax.legend()
    plt.tight_layout()
    plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)

Timings output - 时序输出-

在此处输入图片说明

Observations : 观察结果:

  • Approaches #1, #2 does argsort on the non-zero elements across the entire input array. 方法#1,#2对整个输入数组中的非零元素进行argsort As such, it performs better with lesser percentage of non-zeros. 因此,它在非零百分比较小的情况下表现更好。

  • Approach #3 creates random numbers of the same shape as the input array and then gets argsort indices per row. 方法3创建与输入数组形状相同的随机数,然后每行获取argsort索引。 Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches. 因此,对于输入中给定数量的非零,其定时比前两种方法更加陡峭。

Conclusion : 结论:

Approach #1 seems to be doing pretty well until 60% non-zero mark. 直到60%的非零标记为止,方法1的表现似乎都不错。 For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently. 对于更多的非零值,并且如果行长很小,则方法3似乎表现不错。

I came up with that: 我想到了:

nz = a.nonzero()                      # Get nonzero indexes
a[nz] = np.random.permutation(a[nz])  # Shuffle nonzero values with mask

Which look simpler (and a little bit faster?) than other proposed solutions. 哪个比其他建议的解决方案看起来更简单(快一点?)。


EDIT: new version that does not mix rows 编辑:不混合行的新版本

 labels, *idx = nz = a.nonzero()                                    # get masks
 a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i])  # permute values
                         for i in np.unique(labels)])               # for each label

Where the first array of a.nonzero() (indexes of non zero values in axis0) is used as labels. 其中a.nonzero()的第一个数组( a.nonzero()中非零值的索引)用作标签。 This is the trick that does not mix rows. 这是不混合行的技巧。

Then np.random.permutation is applied on a[a.nonzero()] for each "label" independently. 然后,将np.random.permutation应用于每个“标签”的a[a.nonzero()]

Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation . 可以在此处使用scipy.ndimage.measurements.labeled_comprehension ,因为它似乎因np.random.permutation而失败。

And I finally saw that it looks a lot like what @randomir proposed. 我终于看到它看起来很像@randomir提出的内容。 Anyway, it was just for the challenge of getting it to work. 无论如何,这只是为了使其工作而面临的挑战。


EDIT2: 编辑2:

Finally got it working with scipy.ndimage.measurements.labeled_comprehension 最后使用scipy.ndimage.measurements.labeled_comprehension

def shuffle_rows(a):
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, *idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Where: 哪里:

  1. func() shuffles the non zero values func()改组非零值
  2. labeled_comprehension applies func() label-wise labeled_comprehension标签方式应用func()

This replaces the previous for loop and will be faster on arrays with many rows. 这将替换先前的for循环,并且在具有许多行的阵列上会更快。

This is one possibility for a vectorized solution: 对于矢量化解决方案,这是一种可能性:

r, c = np.where(x > 0)
n = c.size

perm = np.random.permutation(n)
i = np.argsort(perm + r * n)

x[r, c] = x[r, c[i]]

The challenge in vectorizing this problem is that np.random.permutation gives only flat indices, which would shuffle the array elements across rows. 向量化此问题的挑战在于np.random.permutation仅给出平面索引,这将使行中的数组元素np.random.permutation Sorting the permuted values with an offset added makes sure no shuffling across rows occurs. 对添加了偏移量的排列后的值进行排序可确保不会在行之间发生混洗。

Here's your two liner without needing to install numpy. 这是您的两个衬板,无需安装numpy。

from random import random

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    shuffled_nonzero = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    print([i for i in (i if i==0 else next(shuffled_nonzero) for i in input_list)])

if you dont like one liners though, you can either make this a generator with 如果您不喜欢这种内胆,可以用

def shuffle_nonzeros(input_list):
    ''' generator that yields a list with the non-zero values shuffled '''
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            yield i
        else:
            yield next(random_nonzero_values)

or if you want a list as the output, and dont like one line comprehensions 或者如果您想要列表作为输出,并且不喜欢一行理解

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    out = []
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            out.append(i)
        else:
            out.append(next(random_nonzero_values))
    return out

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM