简体   繁体   English

更好的方法来统一打乱两个 numpy 数组

[英]Better way to shuffle two numpy arrays in unison

I have two numpy arrays of different shapes, but with the same length (leading dimension).我有两个不同形状的 numpy 数组,但长度相同(前导维度)。 I want to shuffle each of them, such that corresponding elements continue to correspond -- ie shuffle them in unison with respect to their leading indices.我想对它们中的每一个进行洗牌,以使相应的元素继续对应——即相对于它们的前导索引统一洗牌。

This code works, and illustrates my goals:此代码有效,并说明了我的目标:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:例如:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.然而,这感觉笨重、低效且缓慢,并且需要制作数组的副本——我宁愿将它们就地洗牌,因为它们会很大。

Is there a better way to go about this?有没有更好的方法来解决这个问题? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.更快的执行和更低的内存使用是我的主要目标,但优雅的代码也会很好。

One other thought I had was this:我的另一个想法是:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.这行得通......但它有点可怕,因为我看不到它会继续工作 - 例如,它看起来不像保证在 numpy 版本中存活的那种东西。

Your can use NumPy's array indexing :您可以使用 NumPy 的数组索引

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.这将导致创建单独的 unison-shuffled 数组。

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html要了解更多信息,请参阅http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

Your "scary" solution does not appear scary to me.你的“可怕”解决方案对我来说并不可怕。 Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm.对两个相同长度的序列调用shuffle()会导致对随机数生成器的调用次数相同,这些是 shuffle 算法中唯一的“随机”元素。 By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle() , so the whole algorithm will generate the same permutation.通过重置状态,您可以确保对随机数生成器的调用将在第二次调用shuffle()时给出相同的结果,因此整个算法将生成相同的排列。

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now.如果您不喜欢这样,另一种解决方案是将数据存储在一个数组中,而不是从一开始就存储在两个数组中,然后在这个单一数组中创建两个视图,模拟您现在拥有的两个数组。 You can use the single array for shuffling and the views for all other purposes.您可以将单个数组用于洗牌,并将视图用于所有其他目的。

Example: Let's assume the arrays a and b look like this:示例:假设数组ab如下所示:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:我们现在可以构造一个包含所有数据的数组:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b :现在我们创建模拟原始ab的视图:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c . a2b2的数据与c共享。 To shuffle both arrays simultaneously, use numpy.random.shuffle(c) .要同时打乱两个数组,请使用numpy.random.shuffle(c)

In production code, you would of course try to avoid creating the original a and b at all and right away create c , a2 and b2 .在生产代码中,您当然会尽量避免创建原始ab并立即创建ca2b2

This solution could be adapted to the case that a and b have different dtypes.该解决方案可以适用于ab具有不同 dtype 的情况。

Very simple solution:非常简单的解决方案:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way两个数组 x,y 现在都以相同的方式随机打乱

James wrote in 2015 an sklearn solution which is helpful.詹姆斯在 2015 年写了一个有用的 sklearn解决方案 But he added a random state variable, which is not needed.但是他添加了一个随机状态变量,这不是必需的。 In the below code, the random state from numpy is automatically assumed.在下面的代码中,自动假定来自 numpy 的随机状态。

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]

Shuffle any number of arrays together, in-place, using only NumPy.仅使用 NumPy 将任意数量的数组就地混洗在一起。

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

And can be used like this并且可以这样使用

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

A few things to note:需要注意的几点:

  • The assert ensures that all input arrays have the same length along their first dimension.断言确保所有输入数组沿它们的第一维具有相同的长度。
  • Arrays shuffled in-place by their first dimension - nothing returned.数组按它们的第一个维度就地打乱 - 没有返回。
  • Random seed within positive int32 range.正 int32 范围内的随机种子。
  • If a repeatable shuffle is needed, seed value can be set.如果需要可重复的随机播放,可以设置种子值。

After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.洗牌后,可以使用np.split拆分数据或使用切片引用数据 - 取决于应用程序。

you can make an array like:你可以制作一个像这样的数组:

s = np.arange(0, len(a), 1)

then shuffle it:然后洗牌:

np.random.shuffle(s)

now use this s as argument of your arrays.现在使用 this 作为你的数组的参数。 same shuffled arguments return same shuffled vectors.相同的洗牌参数返回相同的洗牌向量。

x_data = x_data[s]
x_label = x_label[s]

There is a well-known function that can handle this:有一个众所周知的函数可以处理这个问题:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

Just setting test_size to 0 will avoid splitting and give you shuffled data.只需将 test_size 设置为 0 即可避免拆分并为您提供打乱的数据。 Though it is usually used to split train and test data, it does shuffle them too.虽然它通常用于拆分训练和测试数据,但它也确实对它们进行了洗牌。
From documentation文档

Split arrays or matrices into random train and test subsets将数组或矩阵拆分为随机训练和测试子集

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.包装输入验证和 next(ShuffleSplit().split(X, y)) 的快速实用程序,以及将输入数据输入到单个调用中的应用程序,以便在单列中拆分(以及可选的子采样)数据。

This seems like a very simple solution:这似乎是一个非常简单的解决方案:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.可以对连接列表进行就地改组的一种方法是使用种子(它可以是随机的)并使用 numpy.random.shuffle 进行改组。

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That's it.而已。 This will shuffle both a and b in the exact same way.这将以完全相同的方式洗牌 a 和 b。 This is also done in-place which is always a plus.这也是就地完成的,这总是一个优点。

EDIT, don't use np.random.seed() use np.random.RandomState instead编辑,不要使用 np.random.seed() 使用 np.random.RandomState 代替

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:当调用它时,只需传入任何种子来提供随机状态:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:输出:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state编辑:修复了重新播种随机状态的代码

Say we have two arrays: a and b.假设我们有两个数组:a 和 b。

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]]) 

We can first obtain row indices by permutating first dimension我们可以首先通过排列第一个维度来获得行索引

indices = np.random.permutation(a.shape[0])
[1 2 0]

Then use advanced indexing.然后使用高级索引。 Here we are using the same indices to shuffle both arrays in unison.在这里,我们使用相同的索引来统一打乱两个数组。

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

This is equivalent to这相当于

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array如果您想避免复制数组,那么我建议您不要生成排列列表,而是遍历数组中的每个元素,并将其随机交换到数组中的另一个位置

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.这实现了 Knuth-Fisher-Yates 洗牌算法。

Shortest and easiest way in my opinion, use seed :在我看来,最短和最简单的方法是使用种子

random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)

With an example, this is what I'm doing:举个例子,这就是我正在做的事情:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

I extended python's random.shuffle() to take a second arg:我扩展了 python 的 random.shuffle() 以获取第二个参数:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.这样我就可以确定改组发生在原地,并且函数不会太长或太复杂。

Just use numpy ...只需使用numpy ...

First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method.首先合并两个输入数组,一维数组是标签(y),二维数组是数据(x),并使用 NumPy shuffle方法对它们进行混洗。 Finally split them and return.最后将它们分开并返回。

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

most solutions above work, however if you have column vectors you have to transpose them first.上述大多数解决方案都有效,但是如果您有列向量,则必须先转置它们。 here is an example这是一个例子

def shuffle(self) -> None:
    """
    Shuffles X and Y
    """
    x = self.X.T
    y = self.Y.T
    p = np.random.permutation(len(x))
    self.X = x[p].T
    self.Y = y[p].T

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM