I have a large data matrix and I want calculate the similarity matrix of that large matrix but due to memory limitation I want to split the calculation.
Lets assume I have following: For the example I have taken a smaller matrix
data1 = data/np.linalg.norm(data,axis=1)[:,None]
(Pdb) data1
array([[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0.04777415, 0.00091094, 0.01326067, ..., 0. ,
0. , 0. ],
...,
[ 0. , 0.01503281, 0.00655707, ..., 0. ,
0. , 0. ],
[ 0.00418038, 0.00308079, 0.01893477, ..., 0. ,
0. , 0. ],
[ 0.06883803, 0. , 0.0209448 , ..., 0. ,
0. , 0. ]])
They I try to do following:
similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
n1,n2,m1,m2 been calculated as follows: (df is a data frame)
data = df.values
m, k = data.shape
n1=0; n2=m/2; m1=n2+1; m2=m;
But the error is:
(Pdb) similarity_matrix[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
*** NameError: name 'similarity_matrix' is not defined
Didn't you do something like
similarity_matrix = np.empty((N,M),dtype=float)
at the start of your calculations?
You can't index an array, on right or left side of an equation, before you create it.
If that full (N,M)
matrix is too big for memory, then just assign your einsum
value to another variable, and work with that.
partial_matrix = np.einsum...
How you relate that partial_matrix
to the virtual similarity_matrix
is a different issue.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.