简体繁体 English

如何理解字节对编码？

[英]How to understand byte pair encoding?

原文 2020-03-12 18:21:14 4 1 python/ scikit-learn/ nlp/ vectorization

I read a lot of tutorial about BPE but I am still confuse how it works.我阅读了很多关于 BPE 的教程，但我仍然对它的工作原理感到困惑。

for example.例如。 In a tutorial online, they said the folowing :在在线教程中，他们说以下内容：

Algorithm算法

Prepare a large enough training data (ie corpus)准备足够大的训练数据（即语料库）

Define a desired subword vocabulary size定义所需的子词词汇量

Split word to sequence of characters and appending suffix “” to end of将单词拆分为字符序列并在末尾附加后缀“”

word with word frequency.词频的词。 So the basic unit is character in this stage.所以这个阶段的基本单位是性格。 For example, the frequency of “low” is 5, then we rephrase it to “low ”: 5 Generating a new subword according to the high frequency occurrence.比如“low”出现的频率是5，那么我们改写成“low”： 5 根据高频出现生成一个新的子词。 Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.重复步骤 4，直到达到步骤 2 中定义的子词词汇量大小或下一个最高频率对为 1。

Taking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s.以“low: 5”、“lower: 2”、“newest: 6”和“widest: 3”为例，频率最高的子词对是e和s。 It is because we get 6 count from newest and 3 count from widest.这是因为我们从最新得到 6 个计数，从最宽得到 3 个计数。 Then new subword (es) is formed and it will become a candidate in next iteration.然后形成新的子词（es），它将成为下一次迭代的候选词。

In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t.在第二次迭代中，下一个高频子词对是 es（由前一次迭代生成）和 t。 It is because we get 6count from newest and 3 count from widest.这是因为我们从最新得到 6 个计数，从最宽得到 3 个计数。

I do not understand why low is 5 and lower is 2:我不明白为什么低是 5 而低是 2：

does this meand l , o, w , lo, ow + = 6 and then lower equal two but why is not e, r, er which gives three ?这是否意味着 l , o, w , lo, ow + = 6 然后降低等于 2 但为什么不是 e, r, er 给出三？

1 个解决方案

The numbers you are asking about are the frequencies of the words in the corpus.你问的数字是语料库中单词的频率。 The word "low" was seen in the corpus 5 times and the word "lower" 2 times (they just assume this for the example). “low”这个词在语料库中出现了 5 次，“lower”这个词出现了 2 次（他们只是假设这个例子）。

In the first iteration we see that the character pair "es" is the most frequent one because it appears 6 times in the 6 occurrences of "new es t" and 3 times in the 3 occurrences of the word "wid es t".在第一次迭代中，我们看到字符对“es”是最常见的，因为它在“new es t”的 6 次出现中出现了 6 次，在单词“wid es t”出现的 3 次中出现了 3 次。

In the second iteration we have "es" as a unit in our vocabulary the same way we have single characters.在第二次迭代中，我们将“es”作为词汇表中的一个单位，就像我们拥有单个字符一样。 Then we see that "est" is the most common character combination ("new est " and "wid est ").然后我们看到“est”是最常见的字符组合（“new est ”和“wid est ”）。