简体   繁体   English

如何从 BERT model 获得词嵌入的余弦相似度

[英]How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).我对如何从 BERT model 获得不同句子中词嵌入的相似性很感兴趣(实际上,这意味着词在不同场景中具有不同的含义)。

For example:例如:

sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'

I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence.我想从 send1 和 sent2 中获取 cos(New York, New York) 的值,即使短语“New York”相同,但它出现在不同的句子中。 I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2我从https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2得到了一些直觉

But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.但是我仍然不知道我需要提取哪个层的嵌入以及如何计算上述示例的 cos 相似度。

Thanks in advance for any suggestions!在此先感谢您的任何建议!

Okay let's do this.好的,让我们这样做。

First you need to understand that BERT has 13 layers.首先你需要了解 BERT 有 13 层。 The first layer is basically just the embedding layer that BERT gets passed during the initial training.第一层基本上只是 BERT 在初始训练期间通过的嵌入层。 You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding.您可以使用它,但可能不想使用它,因为它本质上是 static 嵌入,而您正在使用动态嵌入。 For simplicity I'm going to only use the last hidden layer of BERT.为简单起见,我将只使用 BERT 的最后一个隐藏层。

Here you're using two words: "New" and "York".在这里,您使用了两个词:“New”和“York”。 You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted.您可以在预处理期间将其视为一个,并将其组合成“New-York”或其他如果您真的想要的东西。 In this case I'm going to treat it as two separate words and average the embedding that BERT produces.在这种情况下,我将把它视为两个单独的词,并对 BERT 产生的嵌入进行平均。

This can be described in a few steps:这可以用几个步骤来描述:

  1. Tokenize the inputs标记输入
  2. Determine where the tokenizer has word_ids for New and York (suuuuper important)确定标记器在哪里有纽约和纽约的 word_ids(suuuuper 很重要)
  3. Pass through BERT通过 BERT
  4. Average平均
  5. Cosine similarity余弦相似度

First, what you need to import: from transformers import AutoTokenizer, AutoModel一、需要导入的东西: from transformers import AutoTokenizer, AutoModel

Now we can create our tokenizer and our model:现在我们可以创建我们的标记器和我们的 model:

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model =

EDIT: hit enter too soon.编辑:太快按回车。 Adding example添加示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM