[英]How to perform operations on very big torch tensors without splitting them
My Task :我的任务:
I'm trying to calculate the pair-wise distance between every two samples in two big tensors (for k-Nearest-Neighbours), That is - given tensor test
with shape (b1,c,h,w)
and tensor train
with shape (b2,c,h,w)
, I need || test[i]-train[j] ||
我正在尝试计算两个大张量中每两个样本之间的成对距离(对于 k-Nearest-Neighbours),即 - 给定具有形状
(b1,c,h,w)
的张量test
和具有形状的张量train
(b2,c,h,w)
,我需要|| test[i]-train[j] ||
|| test[i]-train[j] ||
for every i
, j
.对于每个
i
, j
。 (where both test[i]
and train[j]
have shape (c,h,w)
, as those are sampes in the batch). (其中
test[i]
和train[j]
都具有形状(c,h,w)
,因为它们是批次中的样本)。
The Problem问题
both train
and test
are very big, so I can't fit them into RAM train
和test
都很大,所以我无法将它们放入 RAM
My current solution我目前的解决方案
For a start, I did not construct these tensors in one go - As I build them, I split the data Tensor and save them separately to memory, so I end up with files {Test\test_1,...,Test\test_n}
and {Train\train_1,...,Train\train_m}
.首先,我没有一次性构建这些张量 - 在构建它们时,我拆分数据张量并将它们分别保存到内存中,所以我最终得到了文件
{Test\test_1,...,Test\test_n}
和{Train\train_1,...,Train\train_m}
。 Then, I load in a nested for
loop every Test\test_i
and Train\train_j
, calculate the current distance, and save it.然后,我在每个
Test\test_i
和Train\train_j
加载一个嵌套for
循环,计算当前距离并保存它。
This semi-pseudo-code might explain这个半伪代码可以解释
test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1))
all_distances = []
for test_i in test_files:
test_i = torch.load(test_i) # shape (c,h,w)
dist_of_i_from_all_j = torch.Tensor([])
for train_j in train_files:
train_j = torch.load(train_j) # shape (c,h,w)
dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances
What I thought might work我认为可能有效的方法
I came across FAISS repository , in which they explain that this process can be sped up (maybe?) using their solutions, though I'm not quite sure how.我遇到了FAISS 存储库,他们在其中解释说可以使用他们的解决方案加快这个过程(也许?),尽管我不太确定如何。 Regardless, any approach would help!
无论如何,任何方法都会有所帮助!
Did you check the FAISS documentation ?你检查过FAISS 文件吗?
If what you need is the L2 norm ( torch.cidst
uses p=2
as default parameter) then it is quite straightforward.如果您需要的是 L2 规范(
torch.cidst
使用p=2
作为默认参数),那么它非常简单。 Code below is an adaptation of the FAISS docs to your example:下面的代码是根据您的示例改编的 FAISS 文档:
import faiss
import numpy as np
d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.
index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained)
index.add(x_test) # add vectors to the index
print(index.ntotal)
k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k) # actual search
print(I[:5]) # neighbors of the 100 first queries
print(I[-5:]) # neighbors of the 100 last queries
Consequently, I chose to implement some version of the Earth-Movers-Distance, as was suggested in the following ai.StackExchange
post .因此,我选择实现某个版本的 Earth-Movers-Distance,正如以下
ai.StackExchange
帖子中所建议的那样。 Let me summarize the approach:让我总结一下方法:
Given the task as described in " My Task " above, I defined鉴于上面“我的任务”中描述的任务,我定义了
def cumsum_3d(test, train):
for i in [-1, -2, -3]:
test = torch.cumsum(test, i)
train = torch.cumsum(train, i)
return test, train
then, given the tensors test
and train
:然后,给定张量
test
和train
:
test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))
For future viewers - bare in mind that:对于未来的观众 - 请记住:
FAISS
because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any other version of multidimensional (=shape (c,h,w)
like in my example) tensors distance.FAISS
,因为它目前不支持 windows,但最重要的是它不支持(据我所知)这个版本的 EMD 或任何其他版本的多维(=shape (c,h,w)
如我的例子)张量距离。 To account for the RAM problem I've used Google Colab
and sliced my data to more filesGoogle Colab
并将我的数据切片到更多文件avgpool
) as my activations, It would have been fine not using the EMD, as the output right after the avgpool
has shape (512,)
avgpool
)作为我的激活,不使用 EMD 会很好,因为avgpool
具有形状(512,)
之后的输出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.