简体   繁体   中英

Turn a numpy array of one hot row vectors into a column vector of indices

So what is a concise and efficient way to convert a numpy array like:

[[0, 0, 1],
[1, 0, 0],
[0, 1, 0]]

into a column like:

[[2],
 [0],
 [1]]

where the number in each column is the index value of the "1" in the original array of one hot vectors?

I was thinking of looping through the rows and creating a list of the index value of 1, but I wonder if there is a more efficient way to do it. Thank you for any suggestions.

Update : For a faster solution, see Divakar's answer.


You can use the nonzero() method of the numpy array. The second element of the tuple that it returns is what you want. For example,

In [56]: x
Out[56]: 
array([[0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [1, 0, 0, 0]])

In [57]: x.nonzero()[1]
Out[57]: array([2, 2, 3, 3, 0])

According to the docstring of numpy.nonzero() , "the values in a are always tested and returned in row-major, C-style order", so as long as you have exactly one 1 in each row, x.nonzero()[1] will give the positions of the 1 in each row, starting from the first row. (And x.nonzero()[0] will be equal to range(x.shape[0]) .)

To get the result as an array with shape (n, 1), you can use the reshape() method

In [59]: x.nonzero()[1].reshape(-1, 1)
Out[59]: 
array([[2],
       [2],
       [3],
       [3],
       [0]])

or you can index with [:, np.newaxis] :

In [60]: x.nonzero()[1][:, np.newaxis]
Out[60]: 
array([[2],
       [2],
       [3],
       [3],
       [0]])

We are working with hot-encoded array that guarantees us exactly one 1 per row. So, if we just look for the first non-zero index per row, we would have the desired result. Thus, we could use np.argmax along each row, like so -

a.argmax(axis=1)

If you wanted a 2D array as o/p, simply add a singleton dimension at the end -

a.argmax(axis=1)[:,None]

Runtime test -

In [20]: # Let's create a sample hot encoded array
    ...: a = np.zeros((1000,1000),dtype=int)
    ...: idx = np.random.randint(0,1000,1000)
    ...: a[np.arange(1000),idx] = 1
    ...: 

In [21]: %timeit a.nonzero()[1] # @Warren Weckesser's soln
100 loops, best of 3: 9.03 ms per loop

In [22]: %timeit a.argmax(axis=1)
1000 loops, best of 3: 1.15 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM