如何在非常大的火炬张量上执行操作而不分裂它们

Question

My Task :我的任务：

I'm trying to calculate the pair-wise distance between every two samples in two big tensors (for k-Nearest-Neighbours), That is - given tensor test with shape (b1,c,h,w) and tensor train with shape (b2,c,h,w) , I need || test[i]-train[j] ||我正在尝试计算两个大张量中每两个样本之间的成对距离（对于 k-Nearest-Neighbours），即 - 给定具有形状(b1,c,h,w)的张量test和具有形状的张量train (b2,c,h,w) ，我需要|| test[i]-train[j] || || test[i]-train[j] || for every i , j .对于每个i ， j 。 (where both test[i] and train[j] have shape (c,h,w) , as those are sampes in the batch). （其中test[i]和train[j]都具有形状(c,h,w) ，因为它们是批次中的样本）。

The Problem问题

both train and test are very big, so I can't fit them into RAM train和test都很大，所以我无法将它们放入 RAM

My current solution我目前的解决方案

For a start, I did not construct these tensors in one go - As I build them, I split the data Tensor and save them separately to memory, so I end up with files {Test\test_1,...,Test\test_n} and {Train\train_1,...,Train\train_m} .首先，我没有一次性构建这些张量 - 在构建它们时，我拆分数据张量并将它们分别保存到内存中，所以我最终得到了文件{Test\test_1,...,Test\test_n}和{Train\train_1,...,Train\train_m} 。 Then, I load in a nested for loop every Test\test_i and Train\train_j , calculate the current distance, and save it.然后，我在每个Test\test_i和Train\train_j加载一个嵌套for循环，计算当前距离并保存它。

This semi-pseudo-code might explain这个半伪代码可以解释

test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1)) 
all_distances = []
for test_i in test_files:
    test_i = torch.load(test_i) # shape (c,h,w)
    dist_of_i_from_all_j = torch.Tensor([])
    for train_j in train_files:
        train_j = torch.load(train_j) # shape (c,h,w)
        dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
    all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances

What I thought might work我认为可能有效的方法

I came across FAISS repository , in which they explain that this process can be sped up (maybe?) using their solutions, though I'm not quite sure how.我遇到了FAISS 存储库，他们在其中解释说可以使用他们的解决方案加快这个过程（也许？），尽管我不太确定如何。 Regardless, any approach would help!无论如何，任何方法都会有所帮助！

Answer 1

Did you check the FAISS documentation ?你检查过FAISS 文件吗？

If what you need is the L2 norm ( torch.cidst uses p=2 as default parameter) then it is quite straightforward.如果您需要的是 L2 规范（ torch.cidst使用p=2作为默认参数），那么它非常简单。 Code below is an adaptation of the FAISS docs to your example:下面的代码是根据您的示例改编的 FAISS 文档：

import faiss
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(x_test)                  # add vectors to the index
print(index.ntotal)

k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k)     # actual search
print(I[:5])                   # neighbors of the 100 first queries
print(I[-5:])                  # neighbors of the 100 last queries

Answer 2

Consequently, I chose to implement some version of the Earth-Movers-Distance, as was suggested in the following ai.StackExchange post .因此，我选择实现某个版本的 Earth-Movers-Distance，正如以下ai.StackExchange 帖子中所建议的那样。 Let me summarize the approach:让我总结一下方法：

Given the task as described in " My Task " above, I defined鉴于上面“我的任务”中描述的任务，我定义了

def cumsum_3d(test, train):
    for i in [-1, -2, -3]:
        test = torch.cumsum(test, i)
        train = torch.cumsum(train, i)
    return test, train

then, given the tensors test and train :然后，给定张量test和train ：

test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))

For future viewers - bare in mind that:对于未来的观众 - 请记住：

I did not use FAISS because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any other version of multidimensional (=shape (c,h,w) like in my example) tensors distance.我没有使用FAISS ，因为它目前不支持 windows，但最重要的是它不支持（据我所知）这个版本的 EMD 或任何其他版本的多维（=shape (c,h,w)如我的例子）张量距离。 To account for the RAM problem I've used Google Colab and sliced my data to more files为了解决 RAM 问题，我使用了Google Colab并将我的数据切片到更多文件
This implementation was only relevant as I was dealing with shallow activation layers.这个实现只在我处理浅激活层时才有意义。 If I were to use the last layer ( avgpool ) as my activations, It would have been fine not using the EMD, as the output right after the avgpool has shape (512,)如果我要使用最后一层（ avgpool ）作为我的激活，不使用 EMD 会很好，因为avgpool具有形状(512,)之后的输出

如何在非常大的火炬张量上执行操作而不分裂它们

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-06-29 08:14:26

解决方案2
1 2022-07-05 14:49:07

如何在非常大的火炬张量上执行操作而不分裂它们

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-06-29 08:14:26

解决方案2 1 2022-07-05 14:49:07

解决方案1
2 已采纳 2022-06-29 08:14:26

解决方案2
1 2022-07-05 14:49:07