简体繁体 English

如何为特定领域的代表性学习任务训练一个 bert model？

[英]How can I train a bert model for representational learning task that is domain specific?

原文 2020-12-08 14:09:10 9 1 python/ embedding/ bert-language-model/ sentence-transformers

I am trying to generate good sentence embeddings for some specific type od texts, using sentence transformer models while testing the the similarity and clustering using kmeans doesnt give good results.我正在尝试为某些特定类型的 od 文本生成良好的句子嵌入，使用句子转换器模型，同时使用 kmeans 测试相似性和聚类并没有给出好的结果。 Any ideas to improve?有什么改进的想法吗？ I was thinking of training any of the sentence transformer model on my dataset(which are just sentences but do not have any labels).我正在考虑在我的数据集上训练任何句子转换器 model（它们只是句子但没有任何标签）。 How can i retrain the existing models specifically on ny data to generate better embeddings.如何专门针对 ny 数据重新训练现有模型以生成更好的嵌入。 Thanks.谢谢。

1 个解决方案

The sentence embeddings produced by pre-trained BERT model are generic and need not be appropriate for all the tasks.由预训练的 BERT model 生成的句子嵌入是通用的，不需要适用于所有任务。

To solve this problem:要解决这个问题：

Fine-tune the model with the task specific corpus on the given task (If the end goal is classification, fine-tune the model for classification task, later you can use the embeddings from the BERT model) (This is the method suggested for the USE embeddings, especially when the model remains a black-box)使用给定任务上的任务特定语料库微调 model（如果最终目标是分类，微调 model 用于分类任务，稍后您可以使用来自 BERT 模型的嵌入）（这是建议的方法使用嵌入，尤其是当 model 仍然是黑盒时）
Fine-tune the model in unsupervised manner using masked language model.使用掩码语言 model 以无监督方式微调 model。 This doesn't require you to know the task before hand, but you can just use the actual BERT training strategy to adapt to your corpus.这不需要你事先知道任务，但你可以使用实际的 BERT 训练策略来适应你的语料库。