简体繁体 English

为 OOV 词添加新向量的正确方法

[英]Proper way to add new vectors for OOV words

原文 2020-07-28 23:28:48 3 1 python/ nlp/ spacy/ fasttext

I'm using some domain-specific language which have a lot of OOV words as well as some typos.我正在使用一些特定领域的语言，其中有很多 OOV 单词以及一些拼写错误。 I have noticed Spacy will just assign an all-zero vector for these OOV words, so I'm wondering what's the proper way to handle this.我注意到 Spacy 只会为这些 OOV 词分配一个全零向量，所以我想知道处理这个问题的正确方法是什么。 I appreciate clarification on all of these points if possible:如果可能的话，我感谢澄清所有这些要点：

What exactly does the pre-train command do? pre-train 命令究竟做了什么？ Honestly I cannot seem to parse correctly the explanation from the website:老实说，我似乎无法正确解析网站上的解释：

Pre-train the “token to vector” (tok2vec) layer of pipeline components, using an approximate language-modeling objective.使用近似语言建模目标预训练管道组件的“token to vector”（tok2vec）层。 Specifically, we load pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which match the pretrained ones具体来说，我们加载预训练的向量，并训练像 CNN、BiLSTM 等组件来预测与预训练的向量匹配的向量

Isn't the tok2vec the part that generates the vectors? tok2vec 不是生成向量的部分吗？ So shouldn't this command then change the produced vectors?那么这个命令不应该改变生成的向量吗？ What does it mean loading pretrained vectors and then train a component to predict these vectors?加载预训练的向量然后训练一个组件来预测这些向量是什么意思？ What's the purpose of doing this?这样做的目的是什么？

What does the --use-vectors flag do? --use-vectors 标志有什么作用？ What does the --init-tok2vec flag do? --init-tok2vec 标志有什么作用？ Is this included by mistake in the documentation?这是否被错误地包含在文档中？

It seems pretrain is not what I'm looking for, it doesn't change the vectors for a given word.似乎 pretrain 不是我想要的，它不会改变给定单词的向量。 What would be the easiest way to generate a new set of vectors which includes my OOV words but still contain the general knowledge of the lanaguage?生成一组包含我的 OOV 单词但仍包含该语言的一般知识的新向量集的最简单方法是什么？
As far as I can see Spacy's pretrained models use fasttext vectors.据我所知，Spacy 的预训练模型使用快速文本向量。 Fasttext website mentions: Fasttext 网站提到：

A nice feature is that you can also query for words that did not appear in your data.一个不错的功能是您还可以查询未出现在数据中的单词。 Indeed words are represented by the sum of its substrings, As long as the unknown word is made of known substrings, there is a representation of it!的确，词是由它的子串之和来表示的，只要未知词是由已知子串组成的，就有它的表示！

But it seems Spacy does not use this feature.但似乎 Spacy 没有使用此功能。 Is there a way to still make use of this for OOV words?有没有办法仍然将它用于 OOV 单词？

Thanks a lot非常感谢

1 个解决方案

I think there is some confusion about the different components - I'll try to clarify:我认为不同的组件存在一些混淆 - 我将尝试澄清：

The tokenizer does not produce vectors.分词器不产生向量。 It's just a component that segments texts into tokens.它只是一个将文本分割成标记的组件。 In spaCy, it's rule-based and not trainable, and doesn't have anything to do with vectors.在 spaCy 中，它是基于规则的且不可训练的，并且与向量没有任何关系。 It looks at whitespace and punctuation to determine which are the unique tokens in a sentence.它查看空格和标点符号以确定哪些是句子中的唯一标记。
An nlp model in spaCy can have predefined (static) word vectors that are accessible on the Token level. nlp中的 nlp model 可以具有可在Token级别访问的预定义（静态）词向量。 Every token with the same Lexeme gets the same vector.具有相同Lexeme的每个标记都得到相同的向量。 Some tokens/lexemes may indeed be OOV, like misspellings.某些标记/词位可能确实是 OOV，例如拼写错误。 If you want to redefine/extend all vectors used in a model, you can use something like init-model .如果你想重新定义/扩展 model 中使用的所有向量，你可以使用类似init-model东西。
The tok2vec layer is a machine learning component that learns how to produce suitable (dynamic) vectors for tokens. tok2vec层是一个机器学习组件，它学习如何为令牌生成合适的（动态）向量。 It does this by looking at lexical attributes of the token, but may also include the static vectors of the token (cf item 2).它通过查看令牌的词汇属性来做到这一点，但也可能包括令牌的 static 向量（参见第 2 项）。 This component is generally not used by itself, but is part of another component, such as an NER.该组件通常不单独使用，而是作为另一个组件的一部分，例如 NER。 It will be the first layer of the NER model, and it can be trained as part of training the NER, to produce vectors that are suitable for your NER task.它将是 NER model 的第一层，它可以作为训练 NER 的一部分进行训练，以生成适合您的 NER 任务的向量。

In spaCy v2, you can first train a tok2vec component with pretrain , and then use this component for a subsequent train command.在 spaCy v2 中，您可以先使用pretrain训练一个 tok2vec 组件，然后将此组件用于后续的train命令。 Note that all settings need to be the same across both commands, for the layers to be compatible.请注意，两个命令的所有设置都必须相同，以使图层兼容。

To answer your questions:要回答您的问题：

Isn't the tok2vec the part that generates the vectors? tok2vec 不是生成向量的部分吗？

If you mean the static vectors, then no.如果您的意思是 static 向量，那么没有。 The tok2vec component produces new vectors (possibly with a different dimension) on top of the static vectors, but it won't change the static ones. tok2vec 组件在 static 向量之上生成新向量（可能具有不同的维度），但不会更改 static 向量。

What does it mean loading pretrained vectors and then train a component to predict these vectors?加载预训练的向量然后训练一个组件来预测这些向量是什么意思？ What's the purpose of doing this?这样做的目的是什么？

The purpose is to get a tok2vec component that is already pretrained from external vectors data.目的是获得一个已经从外部向量数据预训练的tok2vec组件。 The external vectors data already embeds some "meaning" or "similarity" of the tokens, and this is -so to say- transferred into the tok2vec component, which learns to produce the same similarities.外部向量数据已经嵌入了令牌的一些“意义”或“相似性”，这可以说是转移到tok2vec组件中，该组件学会产生相同的相似性。 The point is that this new tok2vec component can then be used & further fine-tuned in the subsequent train command (cf item 3)关键是这个新的tok2vec组件可以在随后的train命令中使用和进一步微调（参见第 3 项）

Is there a way to still make use of this for OOV words?有没有办法仍然将它用于 OOV 单词？

It really depends on what your "use" is.这真的取决于你的“用途”是什么。 As https://stackoverflow.com/a/57665799/7961860 mentions, you can set the vectors yourself, or you can implement a user hook which will decide on how to define token.vector .正如https://stackoverflow.com/a/57665799/7961860所提到的，您可以自己设置向量，也可以实现一个用户挂钩，该挂钩将决定如何定义token.vector 。

I hope this helps.我希望这有帮助。 I can't really recommend the best approach for you to follow, without understanding why you want the OOV vectors / what your use-case is.如果不了解您为什么想要 OOV 向量/您的用例是什么，我真的不能推荐您遵循的最佳方法。 Happy to discuss further in the comments!很高兴在评论中进一步讨论！