简体   繁体   English

将二维数组转换为每行唯一值的二维数组

[英]Convert 2d-array to 2d-array of unique values per row

I have a 2d-array of shape 5x4 like this:我有一个形状为 5x4 的二维数组,如下所示:

array([[3, 3, 3, 3],
   [3, 3, 3, 3],
   [3, 3, 2, 2],
   [2, 2, 2, 2],
   [2, 2, 2, 2]])

And I'd like to obtain another array that contains arrays of unique values, something like this:我想获得另一个包含唯一值的 arrays 的数组,如下所示:

array([array([3]), array([3]), array([2, 3]), array([2]), array([2])],
      dtype=object)

I obtained that with the following code:我通过以下代码获得了它:

np.array([np.unique(row) for row in matrix])

However, this is not vectorized.但是,这不是矢量化的。 How could I achieve the same in a vectorized numpy operation?如何在矢量化 numpy 操作中实现相同的效果?

numpy arrays must have a defined shape, so if your data has only 1 value for some rows and 2 or more for others, then that won't do. numpy arrays 必须具有已定义的形状,因此,如果您的数据对于某些行只有1值,而对于其他行只有2或更多值,那么这是行不通的。 A work around is to pad the array with a known value, eg.一种解决方法是用已知值填充数组,例如。 np.nan . np.nan

In this case np.unique will sort it all out for you.在这种情况下, np.unique将为您解决所有问题。 If you use its axis argument.如果你使用它的axis参数。 In this case you want unique values per row, so we use axis=1 :在这种情况下,您希望每行具有唯一值,因此我们使用axis=1

arr = np.array([[3, 3, 3, 3],
                [3, 3, 3, 3],
                [3, 3, 2, 2],
                [2, 2, 2, 2],
                [2, 2, 2, 2]])

np.unique(arr, axis=1)
>>> array([[3, 3],
           [3, 3],
           [2, 3],
           [2, 2],
           [2, 2]])

The result is an array and has the correct unique values for each row, albeit some are duplicated, but this is the price for having an array.结果是一个数组,每行都有正确的唯一值,尽管有些是重复的,但这是拥有一个数组的代价。

Here's one way to minimize the compute when iterating and should help boost performance -这是迭代时最小化计算的一种方法,应该有助于提高性能 -

b = np.sort(a,axis=1)
o = np.ones((len(a),1), dtype=bool)
mask = np.c_[o,b[:,:-1] != b[:,1:]]
c = b[mask]
out = np.split(c, mask.sum(1).cumsum())[:-1]

A loop to use slicing could be better than np.split .使用slicing的循环可能比np.split更好。 So, with each iteration, all we do would be slicing.因此,在每次迭代中,我们所做的只是切片。 Hence, the last step could be replaced by something like this -因此,最后一步可以用这样的东西代替 -

idx = np.r_[0,mask.sum(1).cumsum()]
out = []
for (i,j) in zip(idx[:-1],idx[1:]):
    out.append(c[i:j])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM