简体   繁体   English

如何在非常大的火炬张量上执行操作而不分裂它们

[英]How to perform operations on very big torch tensors without splitting them

My Task :我的任务

I'm trying to calculate the pair-wise distance between every two samples in two big tensors (for k-Nearest-Neighbours), That is - given tensor test with shape (b1,c,h,w) and tensor train with shape (b2,c,h,w) , I need || test[i]-train[j] ||我正在尝试计算两个大张量中每两个样本之间的成对距离(对于 k-Nearest-Neighbours),即 - 给定具有形状(b1,c,h,w)的张量test和具有形状的张量train (b2,c,h,w) ,我需要|| test[i]-train[j] || || test[i]-train[j] || for every i , j .对于每个ij (where both test[i] and train[j] have shape (c,h,w) , as those are sampes in the batch). (其中test[i]train[j]都具有形状(c,h,w) ,因为它们是批次中的样本)。

The Problem问题

both train and test are very big, so I can't fit them into RAM traintest都很大,所以我无法将它们放入 RAM

My current solution我目前的解决方案

For a start, I did not construct these tensors in one go - As I build them, I split the data Tensor and save them separately to memory, so I end up with files {Test\test_1,...,Test\test_n} and {Train\train_1,...,Train\train_m} .首先,我没有一次性构建这些张量 - 在构建它们时,我拆分数据张量并将它们分别保存到内存中,所以我最终得到了文件{Test\test_1,...,Test\test_n}{Train\train_1,...,Train\train_m} Then, I load in a nested for loop every Test\test_i and Train\train_j , calculate the current distance, and save it.然后,我在每个Test\test_iTrain\train_j加载一个嵌套for循环,计算当前距离并保存它。

This semi-pseudo-code might explain这个半伪代码可以解释

test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1)) 
all_distances = []
for test_i in test_files:
    test_i = torch.load(test_i) # shape (c,h,w)
    dist_of_i_from_all_j = torch.Tensor([])
    for train_j in train_files:
        train_j = torch.load(train_j) # shape (c,h,w)
        dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
    all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances

What I thought might work我认为可能有效的方法

I came across FAISS repository , in which they explain that this process can be sped up (maybe?) using their solutions, though I'm not quite sure how.我遇到了FAISS 存储库,他们在其中解释说可以使用他们的解决方案加快这个过程(也许?),尽管我不太确定如何。 Regardless, any approach would help!无论如何,任何方法都会有所帮助!

Did you check the FAISS documentation ?你检查过FAISS 文件吗?

If what you need is the L2 norm ( torch.cidst uses p=2 as default parameter) then it is quite straightforward.如果您需要的是 L2 规范( torch.cidst使用p=2作为默认参数),那么它非常简单。 Code below is an adaptation of the FAISS docs to your example:下面的代码是根据您的示例改编的 FAISS 文档:

import faiss
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(x_test)                  # add vectors to the index
print(index.ntotal)

k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k)     # actual search
print(I[:5])                   # neighbors of the 100 first queries
print(I[-5:])                  # neighbors of the 100 last queries

Consequently, I chose to implement some version of the Earth-Movers-Distance, as was suggested in the following ai.StackExchange post .因此,我选择实现某个版本的 Earth-Movers-Distance,正如以下ai.StackExchange 帖子中所建议的那样。 Let me summarize the approach:让我总结一下方法:

Given the task as described in " My Task " above, I defined鉴于上面“我的任务”中描述的任务,我定义了

def cumsum_3d(test, train):
    for i in [-1, -2, -3]:
        test = torch.cumsum(test, i)
        train = torch.cumsum(train, i)
    return test, train

then, given the tensors test and train :然后,给定张量testtrain

test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))

For future viewers - bare in mind that:对于未来的观众 - 请记住:

  • I did not use FAISS because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any other version of multidimensional (=shape (c,h,w) like in my example) tensors distance.我没有使用FAISS ,因为它目前不支持 windows,但最重要的是它不支持(据我所知)这个版本的 EMD 或任何其他版本的多维(=shape (c,h,w)如我的例子)张量距离。 To account for the RAM problem I've used Google Colab and sliced my data to more files为了解决 RAM 问题,我使用了Google Colab并将我的数据切片到更多文件
  • This implementation was only relevant as I was dealing with shallow activation layers.这个实现只在我处理浅激活层时才有意义。 If I were to use the last layer ( avgpool ) as my activations, It would have been fine not using the EMD, as the output right after the avgpool has shape (512,)如果我要使用最后一层( avgpool )作为我的激活,不使用 EMD 会很好,因为avgpool具有形状(512,)之后的输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM