简体   繁体   English

二维 numpy 数组的列中的唯一条目

[英]Unique entries in columns of a 2D numpy array

I have an array of integers:我有一个整数数组:

import numpy as np

demo = np.array([[1, 2, 3],
                 [1, 5, 3],
                 [4, 5, 6],
                 [7, 8, 9],
                 [4, 2, 3],
                 [4, 2, 12],
                 [10, 11, 13]])

And I want an array of unique values in the columns, padded with something if necessary (eg nan):我想要一个列中唯一值的数组,必要时用一些东西填充(例如nan):

[[1, 4, 7, 10, nan],
 [2, 5, 8, 11, nan],
 [3, 6, 9, 12,  13]]

It does work when I iterate over the transposed array and use a boolean_indexing solution from a previous question .当我遍历转置数组并使用上一个问题中的boolean_indexing解决方案时,它确实有效 But I was hoping there would be a built-in method:但我希望会有一个内置的方法:

solution = []
for row in np.unique(demo.T, axis=1):
    solution.append(np.unique(row))

def boolean_indexing(v, fillval=np.nan):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.full(mask.shape,fillval)
    out[mask] = np.concatenate(v)
    return out

print(boolean_indexing(solution))

AFAIK, there are no builtin solution for that. AFAIK,没有内置的解决方案。 That being said, your solution seems a bit complex to me.话虽如此,您的解决方案对我来说似乎有点复杂。 You could create an array with initialized values and fill it with a simple loop (since you already use loops anyway).您可以创建一个具有初始化值的数组并用一个简单的循环填充它(因为您已经使用了循环)。

solution = [np.unique(row) for row in np.unique(demo.T, axis=1)]

result = np.full((len(solution), max(map(len, solution))), np.nan)
for i,arr in enumerate(solution):
    result[i][:len(arr)] = arr

If you want to avoid the loop you could do:如果你想避免循环,你可以这样做:

demo = demo.astype(np.float32) # nan only works on floats

sort = np.sort(demo, axis=0)
diff = np.diff(sort, axis=0)
np.place(sort[1:], diff == 0, np.nan)
sort.sort(axis=0)
edge = np.argmax(sort, axis=0).max()
result = sort[:edge]

print(result.T)

Output: Output:

array([[ 1.,  4.,  7., 10., nan],
       [ 2.,  5.,  8., 11., nan],
       [ 3.,  6.,  9., 12., 13.]], dtype=float32)

Not sure if this is any faster than the solution given by Jérôme.不确定这是否比 Jérôme 给出的解决方案更快。

EDIT编辑

A slightly better solution稍微好一点的解决方案

demo = demo.astype(np.float32)

sort = np.sort(demo, axis=0)
mask = np.full(sort.shape, False, dtype=bool)
np.equal(sort[1:], sort[:-1], out=mask[1:])
np.place(sort, mask, np.nan)
edge = (~mask).sum(0).max()
result = np.sort(sort, axis=0)[:edge]

print(result.T)

Output: Output:

array([[ 1.,  4.,  7., 10., nan],
       [ 2.,  5.,  8., 11., nan],
       [ 3.,  6.,  9., 12., 13.]], dtype=float32)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM