简体繁体 English

spacy如何使用单词嵌入进行命名实体识别（NER）？

[英]How does spacy use word embeddings for Named Entity Recognition (NER)?

原文 2017-06-12 06:08:38 2 1 python/ nlp/ named-entity-recognition/ spacy

I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. 我正在尝试使用spaCy训练NER模型来识别位置，（人）名称和组织。 I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. 我试图了解spaCy如何识别文本中的实体，但我无法找到答案。 From this issue on Github and this example , it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. 从Github和本例中的这个问题来看，似乎spaCy使用文本中存在的许多功能，例如POS标签，前缀，后缀以及文本中的其他基于字符和单词的功能来训练平均感知器。

However, nowhere in the code does it appear that spaCy uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus). 但是，在代码中没有任何地方看起来spaCy使用GLoVe嵌入（尽管句子/文档中的每个单词似乎都有它们，如果存在于GLoVe语料库中）。

My questions are - 我的问题是 -

Are these used in the NER system now? 现在这些是在NER系统中使用的吗？
If I were to switch out the word vectors to a different set, should I expect performance to change in a meaningful way? 如果我将单词向量切换到另一组，我是否应该期望性能以有意义的方式改变？
Where in the code can I find out how (if it all) spaCy is using the word vectors? 在代码中我可以找到spaCy如何使用单词向量（如果全部）？

I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings. 我试过查看Cython代码，但我无法理解标签系统是否使用了字嵌入。

1 个解决方案

spaCy does use word embeddings for its NER model, which is a multilayer CNN. spaCy确实为其NER模型使用了字嵌入，这是一个多层CNN。 There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here . 有一个非常好的视频，Matthew Honnibal，spaCy的创造者，关于它的NER如何在这里工作。 All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link . 所有三种英语模型都使用在共同爬行上训练的GloVe向量，但较小的模型通过将相似的单词映射到相同的向量链接来“修剪”向量的数量。

It's quite doable to add custom vectors. 添加自定义向量是非常可行的。 There's an overview of the process in the spaCy docs , plus some example code on Github . 有关spaCy 文档中的过程的概述，以及Github上的一些示例代码。