简体   繁体   English

为什么 spacy ner 的结果是高度不可预测的?

[英]Why spacy ner results are highly unpredictable?

I tried spacy for ner but the results are highly unpredictable.Sometimes spacy is not recognizing a particular country.Can anyone please explain why is it happening?我为ner尝试了spacy,但结果非常不可预测。有时spacy无法识别特定国家。谁能解释为什么会这样? I tried on some random sentences.我尝试了一些随机的句子。

CASE 1:情况1:

nlp = spacy.load("en_core_web_sm")
print(nlp)
sent = "hello china hello japan"
doc = nlp(sent)
for i in doc.ents:
  print(i.text," ",i.label_)

OUTPUT:no output in this case. OUTPUT:在这种情况下没有 output。

CASE 2:案例二:

nlp = spacy.load("en_core_web_sm")
print(nlp)
sent = "china is a populous nation in East Asia whose vast landscape encompasses grassland, desert, mountains, lakes, rivers and more than 14,000km of coastline."
doc = nlp(sent)
for i in doc.ents:
  print(i.text," ",i.label_)

OUTPUT: OUTPUT:

<spacy.lang.en.English object at 0x7f2213bde080>
china   GPE
East Asia   LOC
more than 14,000km   QUANTITY

Natural Language models, like spaCy NER, learn from the contextual structure of the sentence (surrounding words).自然语言模型,如 spaCy NER,从句子的上下文结构(周围的单词)中学习。 Why is that?这是为什么? Let's take the word Anwarvic as an example which is a new word that you haven't seen before and probably the spaCy model hasn't seen it before either.让我们以Anwarvic这个词为例,这是一个你以前没有见过的新词,可能spaCy model也没有见过它。 Let's see how the NER model is going to act when the provided sentence change:让我们看看当提供的句子发生变化时,NER model 将如何行动:

  • "I love Anwarvic" “我爱安瓦尔维奇”
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "I love Anwarvic"
>>> doc = nlp(sent)
>>> for i in doc.ents:
...     print(i.text," ",i.label_)
Anwarvic   PERSON
  • "Anwarvic is gigantic" “安瓦尔维奇是巨大的”
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "Anwarvic is gigantic"
>>> doc = nlp(sent)
>>> for i in doc.ents:
...     print(i.text," ",i.label_)
Anwarvic   ORG
  • "Anwarvic is awesome" “安华维奇太棒了”
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "Anwarvic is awesome"
>>> doc = nlp(sent)
>>> for i in doc.ents:
...     print(i.text," ",i.label_)

As we can see, the extracted entites vary when the contextual structure of Anwarvic varies.正如我们所看到的,当Anwarvic的上下文结构发生变化时,提取的实体也会发生变化。 So, in the first sentece the verb love is very common with people.因此,在第一句中,动词love在人们中很常见。 That's why spaCy model predicted it as a PERSON .这就是为什么 spaCy model 将其预测为PERSON的原因。 And the same happens with the second sentence where we use gigantic to describe organizations like ORG .第二句话也是如此,我们使用gigantic来描述像ORG这样的组织。 In the third sentece, awesome is a pretty generic adjective that can be used to describe basically anything.在第三句中, awesome是一个非常通用的形容词,基本上可以用来描述任何事物。 That's why the spaCy NER model was confused.这就是为什么 spaCy NER model 感到困惑的原因。

Sidenote边注

Actually, when I ran the first provided code on my machine, it extracts both china and japan like so:实际上,当我在我的机器上运行第一个提供的代码时,它会同时提取chinajapan ,如下所示:

china   GPE
japan   GPE

NER works normally like this: You let a POS-Tagger tag your sentence with Part-of-Speech labels such as verbs, adjectives and proper nouns. NER 通常是这样工作的:你让POS-Tagger 用词性标签(如动词、形容词和专有名词)标记你的句子。 The NER then looks at the nouns more directly. NER 然后更直接地查看名词。 A POS-Tagger gets better the more information there is to classify POS-tags correctly.正确分类 POS-tag 的信息越多,POS-Tagger 就越好。 That is longer sentences, grammatically correct sentences, and correct spelling.那是更长的句子,语法正确的句子和正确的拼写。

Your first example sent = "hello china hello japan" is short, without verbs etc. which makes it difficult for the tagger to classify POS-Tags.sent = "hello china hello japan"很短,没有动词等,这使得标注器很难对 POS-Tags 进行分类。 And another information is missing: countries are normally written Upper-case: Try sent = "hello China hello Japan" and it will work.并且缺少另一个信息:国家通常写成大写:尝试sent = "hello China hello Japan" ,它会工作。

In your second example, the model detects china correctly even though it is lower case, because there is much more information in the whole sentence.在您的第二个示例中,model 正确检测到 china,即使它是小写的,因为整个句子中有更多信息。

I recomment you to read more about POS-Tagging, it's quite fun!我建议您阅读更多关于 POS-Tagging 的内容,这很有趣!

The first answer to your question is context and that has already been relayed.您问题的第一个答案是上下文,并且已经被转发。 This would be in common with other NLP libraries as well.这也与其他 NLP 库相同。

The second answer which is spaCy specific is non-determinism.第二个特定于 spaCy 的答案是不确定性。 spaCy uses various internal components which are dependent upon a random seed. spaCy 使用依赖于随机种子的各种内部组件。 As stated on this form post , it may be necessary to set the numpy and cupy seeds at minimum to get predictable results.本表格帖子所述,可能需要至少设置 numpy 和 cupy 种子以获得可预测的结果。 It can well happen that one machine will give you a different output than another machine with the same code.很可能一台机器会给你一个不同的 output 而不是另一台具有相同代码的机器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM