在 Huggingface BERT 模型之上添加密集层

Question

I want to add a dense layer on top of the bare BERT Model transformer outputting raw hidden-states, and then fine tune the resulting model.我想在输出原始隐藏状态的裸 BERT 模型转换器之上添加一个密集层，然后微调生成的模型。 Specifically, I am using this base model.具体来说，我正在使用这个基本模型。 This is what the model should do:这是模型应该做的：

Encode the sentence (a vector with 768 elements for each token of the sentence)对句子进行编码（句子的每个标记有 768 个元素的向量）
Keep only the first vector (related to the first token)只保留第一个向量（与第一个标记相关）
Add a dense layer on top of this vector, to get the desired transformation在此向量之上添加一个密集层，以获得所需的转换

So far, I have successfully encoded the sentences:到目前为止，我已经成功地对句子进行了编码：

from sklearn.neural_network import MLPRegressor

import torch

from transformers import AutoModel, AutoTokenizer

# List of strings
sentences = [...]
# List of numbers
labels = [...]

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")

# 2D array, one line per sentence containing the embedding of the first token
encoded_sentences = torch.stack([model(**tokenizer(s, return_tensors='pt'))[0][0][0]
                                 for s in sentences]).detach().numpy()

regr = MLPRegressor()
regr.fit(encoded_sentences, labels)

In this way I can train a neural network by feeding it with the encoded sentences.通过这种方式，我可以通过输入编码的句子来训练神经网络。 However, this approach clearly does not fine tune the base BERT model.然而，这种方法显然没有微调基础 BERT 模型。 Can anybody help me?有谁能够帮我？ How can I build a model (possibly in pytorch or using the Huggingface library) that can be entirely fine tuned?如何构建一个可以完全微调的模型（可能在 pytorch 中或使用 Huggingface 库）？

Answer 1

There are two ways to do it: Since you are looking to fine-tune the model for a downstream task similar to classification, you can directly use:有两种方法可以做到：由于您希望为类似于分类的下游任务微调模型，您可以直接使用：

BertForSequenceClassification class. BertForSequenceClassification类。 Performs fine-tuning of logistic regression layer on the output dimension of 768.在 768 的输出维度上执行逻辑回归层的微调。

Alternatively, you can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.或者，您可以定义一个自定义模块，该模块基于预训练的权重创建一个 bert 模型并在其之上添加层。

from transformers import BertModel
class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
          ### New layers:
          self.linear1 = nn.Linear(768, 256)
          self.linear2 = nn.Linear(256, 3) ## 3 is the number of classes in this example

    def forward(self, ids, mask):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask)

          # sequence_output has the following shape: (batch_size, sequence_length, 768)
          linear1_output = self.linear1(sequence_output[:,0,:].view(-1,768)) ## extract the 1st token's embeddings

          linear2_output = self.linear2(linear2_output)

          return linear2_output

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = CustomBERTModel() # You can pass the parameters if required to have more flexible model
model.to(torch.device("cpu")) ## can be gpu
criterion = nn.CrossEntropyLoss() ## If required define your own criterion
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))

for epoch in epochs:
    for batch in data_loader: ## If you have a DataLoader()  object to get the data.

        data = batch[0]
        targets = batch[1] ## assuming that data loader returns a tuple of data and its targets
        
        optimizer.zero_grad()   
        encoding = tokenizer.batch_encode_plus(data, return_tensors='pt', padding=True, truncation=True,max_length=50, add_special_tokens = True)
        outputs = model(input_ids, attention_mask=attention_mask)
        outputs = F.log_softmax(outputs, dim=1)
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

Answer 2

If you want to tune the BERT model itself you will need to modify the parameters of the model.如果要调整 BERT 模型本身，则需要修改模型的参数。 To do this you will most likely want to do your work with PyTorch.为此，您很可能希望使用 PyTorch 进行工作。 Here is some rough psuedo code to illustrate:这是一些粗略的伪代码来说明：

from torch.optim import SGD

model = ... # whatever model you are using
parameters = model.parameters() # or some more specific set of parameters
optimizer = SGD(parameters,lr=.01) # or whatever optimizer you want
optimizer.zero_grad() # boiler-platy pytorch function

input = ... # whatever the appropriate input for your task is
label = ... # whatever the appropriate label for your task is
loss = model(**input, label) # usuall loss is the first item returned
loss.backward() # calculates gradient
optim.step() # runs optimization algorithm

I've left out all the relevant details because they are quite tedious and specific to whatever your specific task is.我省略了所有相关细节，因为它们非常乏味且特定于您的特定任务。 Huggingface has a nice article walking through this is more detail here , and you will definitely want to refer to some pytorch documentation as you use any pytorch stuff. Huggingface 有一篇不错的文章，详细介绍了这里，你肯定会在使用任何 pytorch 东西时参考一些 pytorch 文档。 I highly recommend the pytorch blitz before trying to do anything serious with it.我强烈推荐pytorch blitz ，然后再尝试做任何严肃的事情。

Answer 3

For anyone using Tensorflow/ Keras the equivalent of Ashwin's answer would be:对于任何使用 Tensorflow/Keras 的人来说，Ashwin 的答案相当于：

from tensorflow import keras
from transformers import AutoTokenizer, TFAutoModel


class CustomBERTModel(keras.Model):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = TFAutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
          ### New layers:
          self.linear1 = keras.layers.Dense(256)
          self.linear2 = keras.layers.Dense(3) ## 3 is the number of classes in this example

    def call(self, inputs, training=False):
          # call expects only one positional argument, so you have to pass in a tuple and unpack. The next parameter is a special reserved training parameter.
          ids, mask = inputs
          sequence_output = self.bert(ids, mask, training=training).last_hidden_state

          # sequence_output has the following shape: (batch_size, sequence_length, 768)
          linear1_output = self.linear1(sequence_output[:,0,:]) ## extract the 1st token's embeddings

          linear2_output = self.linear2(linear1_output)

          return linear2_output


model = CustomBERTModel()
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")

ipts = tokenizer("Some input sequence", return_tensors="tf")
test = model((ipts["input_ids"], ipts["attention_mask"]))

Then to train the model you can make a custom training loop using GradientTape.然后要训练模型，您可以使用 GradientTape 进行自定义训练循环。

You can verify that the additional layers are also trainable with model.trainable_weights .您可以使用model.trainable_weights验证附加层是否也可以训练。 You can access weights for individual layers with eg model.trainable_weights[-1].numpy() would get the last layer's bias vector.您可以访问各个层的权重，例如model.trainable_weights[-1].numpy()将获得最后一层的偏置向量。 [Note the Dense layers will only appear after the first time the call method is executed.] [注意密集层只会在第一次调用方法执行后出现。]

在 Huggingface BERT 模型之上添加密集层

问题描述

3 个解决方案

解决方案1
26 已采纳 2020-10-01 13:56:28

解决方案2
1 2020-10-01 13:33:44

解决方案3
0 2022-07-12 10:39:23

在 Huggingface BERT 模型之上添加密集层

问题描述

3 个解决方案

解决方案1 26 已采纳 2020-10-01 13:56:28

解决方案2 1 2020-10-01 13:33:44

解决方案3 0 2022-07-12 10:39:23

解决方案1
26 已采纳 2020-10-01 13:56:28

解决方案2
1 2020-10-01 13:33:44

解决方案3
0 2022-07-12 10:39:23