简体   繁体   中英

Multi-dimensional array indexing using a single dimensional array in Python

I have two dimensional array, X of size (500,10) and a single dimensional index array Y whose size is 500 whose each entry is an index of correct value column of corresponding row of X, eg, y(0) is 2 then it means column 2 of first row of X is correct, similarly y(3) = 4 means Row 3 and Col 4 of X has correct value.

I want to get all the correct values from X using index array Y without using any loops, ie, using vectorization and in this case the output should be (500,1) . But when i do X[:,y] then it gives output (500,500) . Can someone help me how to correctly index array X using Y, plz.

Thank you all for the help.

Another option is multidimensional list-of-locations indexing:

import numpy as np

ncol = 10  # 10 in your case
nrow = 500  # 500 in your case
# just creating some test data:
x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)

print(x)
print(y)
print(x[np.arange(nrow),y.T].T)

The syntax is explained here . You basically need an array of indices for each dimension. In the first dimension this is simply [0,...,500] in your case and the second dimension is your y-array. We need to transpose it (.T), because it has to have the same shape as the first and the output array. The second transposition is not really needed, but gives you the shape you want.

EDIT:

The question of performance came up and I tried the three methods mentioned so far. You'll need line_profiler to run the following with

kernprof -l -v tmp.py

where tmp.py is:

import numpy as np

@profile
def calc(x,y):
    z = np.arange(nrow)
    a = x[z,y.T].T  # mine, with the suggested speed up
    b = x[:,y].diagonal().T  # Christoph Terasa
    c = np.array([i[j] for i, j in zip(x, y)])  # tobias_k

    return (a,b,c)

ncol = 5  # 10 in your case
nrow = 10  # 500 in your case

x = np.arange(ncol*nrow).reshape(nrow,ncol)
y = (ncol * np.random.random_sample((nrow, 1))).astype(int)

a, b, c = calc(x,y)
print(a==b)
print(b==c)

The output for my python 2.7.6:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    3                                           @profile
    4                                           def calc(x,y):
    5         1            4      4.0      0.1      z = np.arange(nrow)
    6         1           35     35.0      0.8      a = x[z,y.T].T
    7         1         3409   3409.0     76.7      b = x[:,y].diagonal().T
    8       501          995      2.0     22.4      c = np.array([i[j] for i, j in zip(x, y)])
    9                                           
    10         1            1      1.0      0.0      return (a,b,c)

Where %Time or Time are the relevant columns. I don't know how to profile memory consumption, someone else would have to do that. For now it looks like my solution is the fastest for the requested dimensions.

While not really intuitive from a syntactic perspective

X[:,Y].diagonal()[0]

will give you the values you're looking for. The fancy indexing selects from each row of X all values in Y , and diagonal selects only those at the indexes where i == j. The indexing with [0] at the end just flattens the 2d array.

You need an helper vector R to index the rows

In [50]: X = np.arange(24).reshape((6,4))

In [51]: Y = np.random.randint(0,4,6)

In [52]: R = np.arange(6)

In [53]: Y
Out[53]: array([0, 2, 2, 0, 1, 0])

In [54]: X[R,Y]
Out[54]: array([ 0,  6, 10, 12, 17, 20])

for your use case

X_y = X[np.arange(500), Y]

Edit

I forgot to mention, if you want a 2D result you can obtain such a result using a dummy index

X_y_2D = X[np.arange(500), Y, None]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM