computation graph of setting weights in pytorch

Question

I need a clarification of code written for some function in FastAI2 library.

this is the code WeightDropout written in FastAI2 library.

 class WeightDropout(Module):
        "A module that warps another layer in which some weights will be replaced by 0 during training."

        def __init__(self, module, weight_p, layer_names='weight_hh_l0'):
            self.module,self.weight_p,self.layer_names = module,weight_p,L(layer_names)
            for layer in self.layer_names:
                #Makes a copy of the weights of the selected layers.
                w = getattr(self.module, layer)
                delattr(self.module, layer)
                self.register_parameter(f'{layer}_raw', nn.Parameter(w.data))
                setattr(self.module, layer, F.dropout(w.data, p=self.weight_p, training=False))
                if isinstance(self.module, (nn.RNNBase, nn.modules.rnn.RNNBase)):
                    self.module.flatten_parameters = self._do_nothing

        def _setweights(self):
            "Apply dropout to the raw weights."
            for layer in self.layer_names:
                raw_w = getattr(self, f'{layer}_raw')
                setattr(self.module, layer, F.dropout(raw_w.data, p=self.weight_p, training=self.training))

        def forward(self, *args):
            self._setweights()
            with warnings.catch_warnings():
                #To avoid the warning that comes because the weights aren't flattened.
                warnings.simplefilter("ignore")
                return self.module.forward(*args)

        def reset(self):
            for layer in self.layer_names:
                raw_w = getattr(self, f'{layer}_raw')
                setattr(self.module, layer, 
    F.dropout(raw_w.data, p=self.weight_p, training=False))
            if hasattr(self.module, 'reset'): self.module.reset()

        def _do_nothing(self): pass

where above code randomly drops weights in weight matrix of hidden layers.I am primarily interested in,

 def _setweights(self):
                "Apply dropout to the raw weights."
                for layer in self.layer_names:
                    raw_w = getattr(self, f'{layer}_raw')
                    setattr(self.module, layer, F.dropout(raw_w.data, p=self.weight_p, training=self.training))

my question is that, does this operation of changing weights is recorded in gradient computation.

Answer 1

No, assigning a new weight is not tracked in the computational graph, because an assignment has no derivative, therefore it's impossible to get a gradient through it.

Then why does that code work? The model is not overwriting the actual parameters, but it's using a modified version for the calculations, while keeping the original weights unchanged. It's a little obscure, but the most important part is when the parameters are copied when the model is created:

#Makes a copy of the weights of the selected layers.
w = getattr(self.module, layer)
delattr(self.module, layer)
self.register_parameter(f'{layer}_raw', nn.Parameter(w.data))

What happens here, is that for every parameters you create a copy which ends in _raw . For example, if you have a linear layer on your model (eg self.linear1 = nn.Linear(2, 4) , you have two parameters with the names linear1.weight and linear1.bias . Now they are copied to linear1.weight_raw and linear1.bias_raw . To be precise, they are not copied, but reassigned to the *_raw attributes and then the original ones are deleted, hence they are just moved from the original to the raw versions. The originals need to be deleted, since they are no longer parameters (which would be optimised/learned).

Afterwards, when the dropout is applied, the parameters that are optimised/learned ( *_raw versions) are unchanged, but the weight used for the actual calculations is the one with some weights randomly dropped. In the example with the linear layer that would look as follows if you do the calculations manually:

# A dummy input
input = torch.randn(1, 2)

# The raw parameters of the linear layer, randomly initialised
weight_raw = nn.Parameter(torch.randn(4, 2))
bias_raw = nn.Parameter(torch.randn(4))

# Randomly dropping elements of the parameters with 50% probability
weight = F.dropout(weight_raw, p=0.5)
bias = F.dropout(bias_raw, p=0.5)

# Calculation of the linear layer (forward)
output = torch.matmul(input, weight.transpose(0, 1)) + bias

From this you can see that there is no actual reassignment, but just the regular computational flow that you are familiar with.

Now you might be wondering why these *_raw parameters are created instead of applying the dropout in the forward pass (like in the example above). The reason for that is to avoid having to reimplement the forward pass, otherwise every module would need to have their forward method modified, but since they differ widely across modules, that cannot be done in a generic manner. This approach essentially hijacks the parameters, so that the forward pass uses a modified version of them.

Continuing the example from above:

# Using the actual module for the same calculation
linear1 = nn.Linear(2, 4)

# Delete the parameters, so that regular tensors can be assigned to them
# Otherwise it throws an error that the tensor is not an nn.Parameter
del linear1.weight
del linear1.bias

# Assign the parameters with dropped elements
linear1.weight = weight
linear1.bias = bias

# Run the forward pass directly
output_linear1 = linear1(input)

torch.equal(output, output_linear1) # => True

The bottom line is that the parameters are extracted from the modules, and the forward pass uses a modified version (after dropout) for the calculations, they are no longer parameters but intermediate results.

computation graph of setting weights in pytorch

Question

1 answers

solution1
1 2020-05-12 03:04:10

computation graph of setting weights in pytorch

Question

1 answers

solution1 1 2020-05-12 03:04:10

solution1
1 2020-05-12 03:04:10