简体   繁体   English

对于 spacy 的 NER,我是否需要将整个单词 label 作为一个实体?

[英]For spacy's NER, do I need to label the entire word as an entity?

I'm fairly new to spacy and NER.我对 spacy 和 NER 还很陌生。 I am dealing with a problem where I want to label many examples of short-form text data.我正在处理一个问题,我想要 label 的许多短格式文本数据示例。 I want to map company names to a custom entity CUSTOM.我想将 map 公司名称改为自定义实体 CUSTOM。

Example descriptions:示例说明:

Amazon1337XS324, Amazon4357YT322, *Google, Just *Eat

I am currently labeling the training data.我目前正在标记训练数据。 My doubt is whether I should label the entire word as an entity or not eg "Amazon1337XS324" or "Amazon", "*Google" or "Google", and "Just *Eat" or "Just Eat".我的疑问是我是否应该将整个单词 label 作为一个实体,例如“Amazon1337XS324”或“Amazon”、“*Google”或“Google”,以及“Just *Eat”或“Just Eat”。

From this previous post it seems I shouldn't try to remove information that the NER model would find useful. 从上一篇文章看来,我不应该尝试删除 NER model 会发现有用的信息。 Also, in many labeling tutorials the entire word is always labeled.此外,在许多标注教程中,始终标注整个单词。 However, in my use case, the "non-descriptive" subsection of the word could always change, like in the Amazon example, and could end up being noise for the model.但是,在我的用例中,单词的“非描述性”小节总是会发生变化,就像在亚马逊示例中一样,最终可能成为 model 的噪音。

I think I also don't understand if I only provide the entities "Amazon" or "Google" to the spacy's NER model, and new examples come in where there are many new characters next to it in the same word (eg Amazon1337XS325, Amazon1337XS326), will the NER model still be able to identify "Amazon" or "Google" as CUSTOM?我想我也不明白我是否只向 spacy 的 NER model 提供实体“亚马逊”或“谷歌”,并且新的例子出现在同一个单词旁边有许多新字符的地方(例如 Amazon1337XS325、Amazon1337XS326 ),NER model 是否仍能将“Amazon”或“Google”识别为 CUSTOM?

You can't put an NER label on half a token.您不能将 NER label 放在半个令牌上。 The tokenizer is run before NER and the NER component attempts to give a label to each whole token, so if you're only interested in part of a token, the NER component wont' be able to figure that out.标记器在 NER 之前运行,NER 组件尝试为每个完整标记提供 label,因此如果您只对标记的一部分感兴趣,NER 组件将无法弄清楚这一点。

If you don't have some way to separate the tokens in preprocessing, it seems like the only thing you can do is label the whole token.如果您没有办法在预处理中分离令牌,那么您似乎唯一能做的就是 label 整个令牌。 You're right that will make it harder for the model to learn.没错,这将使 model 更难学习。

One alternative is to try training a character-level NER component - basically, split your input into individual characters before training.一种替代方法是尝试训练字符级 NER 组件 - 基本上,在训练之前将输入拆分为单个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM