您需要将 EOS 和 BOS 代币放入自动编码器转换器中吗？

Question

I'm starting to wrap my head around the transformer architecture, but there are some things that I am not yet able to grasp.我开始全神贯注于变压器架构，但有些事情我还无法掌握。

In decoder-free transformers, such as BERT, the tokenizer includes always the tokens CLS and SEP before and after a sentence.在无解码器的转换器中，例如 BERT，标记器总是在句子前后包含标记 CLS 和 SEP。 I understand that CLS acts both as BOS and as a single hidden output that gives the classification information, but I am a bit lost about why does it need SEP for the masked language modeling part.我知道 CLS 既充当 BOS，又充当提供分类信息的单个隐藏 output，但我对为什么屏蔽语言建模部分需要 SEP 有点迷惑。

I'll explain a bit more about the utility I expect to get.我将更多地解释一下我希望获得的实用程序。 In my case, I want to train a transformer to act as an autoencoder, so target = input.在我的例子中，我想训练一个变压器作为自动编码器，所以 target = input. There would be no decoder, since my idea is to reduce the dimensionality of the original vocabulary into less embedding dimensions, and then study (not sure how yet, but will get there) the reduced space in order to extract useful information.不会有解码器，因为我的想法是将原始词汇表的维数减少到更少的嵌入维数，然后研究（不确定如何，但会到达那里）减少的空间以提取有用的信息。

Therefore, an example would be:因此，一个例子是：

string_input = "The cat is black" 
tokens_input =  [1,2,3,4]

string_target = "The cat is black"
tokens_output = [1,2,3,4]

Now when tokenizing, assuming that we tokenize in the basis of word by word, what would be the advantage of adding BOS and EOS?现在在分词的时候，假设我们是逐字分词的，加入BOS和EOS有什么好处呢？

I think these are only useful when you are using the self-attention decoder, right?我认为这些只有在你使用自注意力解码器时才有用，对吧？ so, since in that case, for the decoder the outputs would have to enter right-shifted, the vectors would be:所以，因为在那种情况下，对于解码器，输出必须右移输入，向量将是：

input_string = "The cat is black EOS"
input_tokens = [1,2,3,4,5]

shifted_output_string = "BOS The cat is black"
shifted_output_tokens = [6,1,2,3,4]

output_string = "The cat is black EOS"
output_token = [1,2,3,4,5]

However, BERT does not have a self-attention decoder, but a simple feedforward layer.但是，BERT 没有 self-attention 解码器，只有一个简单的前馈层。 That is why I'm not sure of understanding the purpose of these special tokens.这就是为什么我不确定是否理解这些特殊标记的用途。

In summary, the questions would be:总之，问题是：

Do you always need BOS and EOS tokens, even if you don't have a transformer decoder?你是否总是需要 BOS 和 EOS 代币，即使你没有转换器解码器？
Why does BERT, that does not have a transformer decoder, require the SEP token for the masked language model part?为什么没有转换器解码器的 BERT 需要用于屏蔽语言 model 部分的 SEP 令牌？

Answer 1

First, a little about BERT - BERT word embeddings allow for multiple vector representations for the same word, based on the context in which the word was used.首先，简要介绍一下 BERT——BERT 词嵌入允许根据使用该词的上下文对同一个词进行多个向量表示。 In this sense, BERT embeddings are context-dependent .从这个意义上说，BERT 嵌入是上下文相关的。 BERT explicitly takes the index position of each word in the sentence while calculating its embedding. BERT 在计算其嵌入时明确采用句子中每个单词的索引 position。 The input to BERT is a sentence rather than a single word. BERT 的输入是一个句子而不是一个单词。 This is because BERT needs the context of the whole sentence to determine the vectors of the words in the sentence.这是因为 BERT 需要整个句子的上下文来确定句子中单词的向量。 If you only input a single word vector to BERT it would completely defeat the purpose of BERT's bidirectional, contextual nature.如果你只向 BERT 输入单个词向量，那将完全违背 BERT 的双向、上下文性质的目的。 The output is then a fixed-length vector representation of the whole input sentence. output 是整个输入句子的固定长度向量表示。 BERT provides support for out-of-vocabulary words because the model learns words at a “subword” level (also called “word-pieces” ). BERT 提供对词汇外单词的支持，因为 model 在“子词”级别（也称为“词块” ）学习单词。

The SEP token is used to help BERT differentiate between two different word sequences. SEP令牌用于帮助 BERT 区分两个不同的单词序列。 This is necessary in next-sequence-prediction (NSP).这在下一序列预测 (NSP) 中是必需的。 CLS is also necessary in NSP to let BERT know when the first sequence begins. NSP 中也需要CLS ，让 BERT 知道第一个序列何时开始。 Ideally you would use a format like this:理想情况下，您会使用这样的格式：

CLS [sequence 1] SEP [sequence 2] SEP CLS [序列 1] SEP [序列 2] SEP

Note that we are not using any BOS or EOS tokens.请注意，我们没有使用任何BOS或EOS代币。 The standard BERT tokenizer does not include these.标准的 BERT 分词器不包括这些。 We can see this if we run the following code:如果我们运行以下代码，我们可以看到这一点：

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.eos_token)
print(tokenizer.bos_token)
print(tokenizer.sep_token)
print(tokenizer.cls_token)

Output: None None [SEP] [CLS] Output：无无 [SEP] [CLS]

For masked-language-modeling (MLM), we are only concerned with the MASK token, since the model's objective is merely to guess the masked token.对于屏蔽语言建模 (MLM)，我们只关心MASK标记，因为模型的目标只是猜测屏蔽标记。

BERT was trained on both NSP and MLM and it is the combination of those two training methods that make BERT so effective. BERT 接受了 NSP 和 MLM 的训练，正是这两种训练方法的结合使 BERT 如此有效。

So to answer your questions - you do not "always need" EOS and/or BOS.所以回答你的问题——你并不“总是需要”EOS 和/或 BOS。 In fact, you don't "need" them at all.事实上，您根本“不需要”它们。 However, if you are fine-tuning BERT for a specific downstream task, where you intent to use BOS and EOS tokens (the manner of which, is up to you), then yes I suppose you would include them as special tokens.但是，如果您正在为特定的下游任务微调 BERT，您打算在其中使用 BOS 和 EOS 令牌（具体方式取决于您），那么是的，我想您会将它们作为特殊令牌包括在内。 But understand that BERT was not trained with those in mind and you may see unpredictable/unstable results.但请注意，BERT 的训练并未考虑这些因素，您可能会看到不可预测/不稳定的结果。

您需要将 EOS 和 BOS 代币放入自动编码器转换器中吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-06 19:32:41

您需要将 EOS 和 BOS 代币放入自动编码器转换器中吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-06 19:32:41

解决方案1
1 已采纳 2022-12-06 19:32:41