简体   繁体   中英

put index and data to dict

data = np.random.rand(rows,cols)
vec= np.random.rand(1,cols)
d = ((data-vec)**2).sum(axis=1)  # compute distances
ndx = d.argsort()

than I can take first k

ndx[:k]

but if have

d1 = ((data1-vec)**2).sum(axis=1)  # compute distances
    ndx1 = d1.argsort()
d2 = ((data2-vec)**2).sum(axis=1)  # compute distances
    ndx2 = d2.argsort()

I need to concatenate values+indexes of ndx1+ndx2 and sort by value (take k nearest vectors from 2k vectors).

How can it be done? Do I need to use dict?

UPDATE:

I can't stack data1 and data2 because then it doesn't fit in RAM. I read my big array using numpy.memmap by chunks(1 chunk = data).

For example this works, but only for small sizes.So I need to process data iteratively by chunks.

import numpy as np
import time


rows = 10000
cols = 1000
batches = 5
k= 10
fp = np.memmap('C:/memmap_test', dtype='float32', mode='w+', shape=(rows*batches,cols))

vec= np.random.rand(1,cols)

t0= time.time()
d = ((fp-vec)**2).sum(axis=1)  # compute distances
ndx = d.argsort()
print (time.time()-t0)

print ndx[:k]

This approach don't work:

ValueError: object are not alighn

t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
    d[i*rows:(i+1)*rows] = (np.einsum('ij,ij->i', fp[i*rows:(i+1)*rows], fp[i*rows:(i+1)*rows]) + np.dot(vec, vec) -
             2 * np.dot(fp[i*rows:(i+1)*rows], vec))
print (time.time()-t0)

this seems to work

t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
    d[i*rows:(i+1)*rows] = ((fp[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
ndx = d.argsort()
print (time.time()-t0)
print ndx[:k]

Hoping to have understood the question properly.

If data1 and data2 have at least one of the dimensions equal you can stack vertically or horizontally d1 and d2 and then argsort the stacked array.

This way the ordering will be done on all the elements of the two arrays, but you don't know which one was the original array.

I don't think that dict is the way to go, if not because dict are not ordered.

Edit: problem with memory.

An approach that come to my mind goes more or less like this:

#read the first batch and compute distances
# save the first k indeces and values
masterindex = d.argsort()[:k]
mastervalue = d[masterindex]

for i in (all the other batches):
    #read the following batch and compute distances
    tempindex = d.argsort()[:k]
    tempvalue = d[tempindex]
    # get the tempindex as absolute position with respect to the whole file
    tempindex += n_rows_already_read # by previous batches

    #stack the indeces and value arrays
    masterindex = np.concatenate([masterindex,tempindex])
    mastervalue = np.concatenate([mastervalue,tempvalue])
    # argsort the concatenated values, then save the new sorted 
    # values and indeces
    indx = mastervalue.argsort()[:k]
    masterindex = masterindex[indx]
    mastervalue = mastervalue[indx]

I haven't test the code, so could be buggy, but I hope that it's clear enough and that does what you want

Here is our solution:

import numpy as np

rows1,rows2,cols = 1000,600,7
data1 = np.random.rand(rows1,cols)
data2 = np.random.rand(rows2,cols)

data = np.vstack((data1,data2))     #stacking data

vec = np.random.rand(1,cols)
d = ((data-vec)**2).sum(axis=1)     #compute distances
ndx = d.argsort()

k = 30

sdx = ndx[:k]                       #selected k indices of nearest points

f = (sdx<rows1)                     #masking

idx1 = sdx[f]                       #indices from data1
idx2 = sdx[~f]-rows1                #indices from data2

If you have memory issues you could do something like:

data1 = np.random.rand(rows1, cols)
data2 = np.random.rand(rows2, cols)
vec = np.random.rand(cols)

d = np.empty((rows1 + rows2,))
d[:rows1] = (np.einsum('ij,ij->i', data1, data1) + np.dot(vec, vec) -
             2 * np.dot(data1, vec))
d[rows1:] = (np.einsum('ij,ij->i', data2, data2) + np.dot(vec, vec) -
             2 * np.dot(data2, vec))

You need to know the sizes of data1 and data2 beforehand, to allocate the d array, but you don't need to keep the vectors in memory simultaneously, you could delete data1 once you fill the first part of d , before loading data2 . The way I am calculating distance above, as (ab)**2 = a*a + b*b -2*a*b , is morememory efficient than your approach, especially if cols is large.

You can now sort the array d , and map it to the rows of your two arrays eg as in @Developer's answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM