How to vectorize attention operation and avoid for-loop

Question

I am new to Attention and I have - maybe naively - implemented one such mechanism using a python for-loop in the following code of the forward() function in my model.

Basically, I have an embedding layer for items through which I get embeddings for one item and a sequence of other items which I sum weighted by the attention weights. To get the attention weights I use a sub-network (nn.Sequencial(...)) which takes as input a pair of two item embeddings and outputs a score as in regression. All the scores are then softmaxed and used as attention weights.

def forward(self, input_features, ...):
    ...
    """ B = batch size, I = number of items for attention, E = embedding size """
    ...
    
    # get embeddings from input features for current batch
    embeddings = self.embedding_layer(input_features)         # (B, E)
    other_embeddings = self.embedding_layer(other_features)   # (I, E)

    # attention between pairs of embeddings
    attention_scores = torch.zeros((B, I))             # (B, I)
    for i in range(I):
        # repeat batch-size times for i-th embedding
        repeated_other_embedding = other_embeddings[i].view(1, -1).repeat(B, 1)   # (B, E)

        # concat pairs of embeddings to form input to attention network   
        item_emb_pairs = torch.cat((embeddings.detach(), repeated_other_embedding.detach()), dim=1)

        # pass batch through attention network
        attention_scores[:, [i]] = self.AttentionNet(item_emb_pairs)

    # pass through softmax
    attention_scores = F.softmax(attention_scores, dim=1)   # (B, I)

    ...

How do I avoid the python for-loop which I suspect is what is slowing down training so much? Can I pass a matrix of dimensions (I, B, 2*E) in self.AttentionNet() somehow?

Answer 1

You can use the following snippet.

embeddings = self.embedding_layer(input_features)         # (B, E)
other_embeddings = self.embedding_layer(other_features)   # (I, E)

embs = embeddings.unsqueeze(1).repeat(1, I, 1)              # (B, I, E)
other_embs = other_embeddings.unsqueeze(0).repeat(B, 1, 1)  # (B, I, E)

concatenated_embeddings = torch.cat((embs, other_embs), dim=2)  # (B, I, 2*E)

attention_scores = F.softmax(self.AttentionNet(concatenated_embeddings))    #(B, I)

You may need to make some changes in self.AttentionNet as in this scenario, you are providing the input tensors with the Batch size of B to the attention network.

How to vectorize attention operation and avoid for-loop

Question

1 answers

solution1
0 2021-11-06 04:55:12

How to vectorize attention operation and avoid for-loop

Question

1 answers

solution1 0 2021-11-06 04:55:12

solution1
0 2021-11-06 04:55:12