[英]How does spacy use word embeddings for Named Entity Recognition (NER)?
I'm trying to train an NER model using spaCy
to identify locations, (person) names, and organisations. 我正在尝试使用
spaCy
训练NER模型来识别位置,(人)名称和组织。 I'm trying to understand how spaCy
recognises entities in text and I've not been able to find an answer. 我试图了解
spaCy
如何识别文本中的实体,但我无法找到答案。 From this issue on Github and this example , it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. 从Github和本例中的 这个问题来看,似乎spaCy使用文本中存在的许多功能,例如POS标签,前缀,后缀以及文本中的其他基于字符和单词的功能来训练平均感知器。
However, nowhere in the code does it appear that spaCy
uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus). 但是,在代码中没有任何地方看起来
spaCy
使用GLoVe嵌入(尽管句子/文档中的每个单词似乎都有它们,如果存在于GLoVe语料库中)。
My questions are - 我的问题是 -
spaCy
is using the word vectors? spaCy
如何使用单词向量(如果全部)? I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings. 我试过查看Cython代码,但我无法理解标签系统是否使用了字嵌入。
spaCy does use word embeddings for its NER model, which is a multilayer CNN. spaCy确实为其NER模型使用了字嵌入,这是一个多层CNN。 There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here .
有一个非常好的视频,Matthew Honnibal,spaCy的创造者,关于它的NER如何在这里工作 。 All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link .
所有三种英语模型都使用在共同爬行上训练的GloVe向量,但较小的模型通过将相似的单词映射到相同的向量链接来“修剪”向量的数量。
It's quite doable to add custom vectors. 添加自定义向量是非常可行的。 There's an overview of the process in the spaCy docs , plus some example code on Github .
有关spaCy 文档中的过程的概述,以及Github上的一些示例代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.