简体   繁体   English

如何从Spacy获得更好的引路

[英]How to get better lemmas from Spacy

While "PM" can mean "pm(time)" it can also mean "Prime Minister". 虽然“PM”可以表示“下午(时间)”,但它也可以表示“总理”。

I want to capture the latter. 我想抓住后者。 I want lemma of "PM" to return "Prime Minister". 我希望“PM”的引理回归“总理”。 How can I do this using spacy ? 我怎么能用spacy做到这一点?

Example returning unexpected lemma: 返回意外引理的示例:

>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
...     print(word.text, word.lemma_)
... 
PM pm
means mean
prime prime
minister minister

As per doc https://spacy.io/api/annotation , spacy uses WordNet for lemmas; 根据文档https://spacy.io/api/annotation,spacy使用WordNet进行引理;

A lemma is the uninflected form of a word. 引理是一个词的未反射形式。 The English lemmatization data is taken from WordNet.. 英语词形还原数据取自WordNet ..

When I tried inputting "pm" in Wordnet , it shows "Prime Minister" as one of the lemmas. 当我尝试在Wordnet中输入“pm”时,它显示“首相”是其中一个引理。

What am I missing here? 我在这里错过了什么?

I think it would help answer your question by clarifying some common NLP tasks. 我认为通过澄清一些常见的NLP任务可以帮助回答你的问题。

Lemmatization is the process of finding the canonical word given different inflections of the word. 词形还原是在给出词的不同变形的情况下找到规范词的过程。 For example, run, runs, ran and running are forms of the same lexeme: run. 例如,运行,运行,运行和运行是同一个lexeme:run的形式。 If you were to lemmatize run , runs , and ran the output would be run . 如果你要运行运行运行输出将运行输出。 In your example sentence, note how it lemmatizes means to mean . 在你的例句中,请注意它如何使用lematizes 意思

Given that, it doesn't sound like the task you want to perform is lemmatization. 鉴于此,它听起来不像你想要执行的任务是词形还原。 It may help to solidify this idea with a silly counterexample: what are the different inflections of a hypothetical lemma "pm": pming, pmed, pms? 用一个愚蠢的反例来巩固这个想法可能有所帮助:假设的引理“pm”的不同变形是什么:pming,pmed,pms? None of those are actual words. 这些都不是实际的话。

It sounds like your task may be closer to Named Entity Recognition (NER), which you could also do in spaCy. 听起来您的任务可能更接近命名实体识别 (NER),您也可以在spaCy中执行此操作。 To iterate through the detected entities in a parsed document, you can use the .ents attribute, as follows: 要遍历已解析文档中检测到的实体,可以使用.ents属性,如下所示:

>>> for ent in doc.ents:
...     print(ent, ent.label_)

With the sentence you've given, spacy (v. 2.0.5) doesn't detect any entities. 根据你给出的句子,spacy(v.2.0.5)没有检测到任何实体。 If you replace "PM" with "PM" it will detect that as an entity, but as a GPE. 如果将“PM”替换为“PM”,它将检测到它作为实体,但作为GPE。

The best thing to do depends on your task, but if you want your desired classification of the "PM" entity, I'd look at setting entity annotations . 最好的办法取决于你的任务,但如果你想要你想要的“PM”实体分类,我会看看设置实体注释 If you want to pull out every mention of "PM" from a big corpus of documents, use the matcher in a pipeline . 如果你想从一大堆文档中提取每一个“PM”,请在管道中使用匹配器

When I run lemmas of prime minister on nltk.wordnet (which uses it as well) I get: 当我在nltk.wordnet(也使用它)上运行总理的引理时,我得到:

>>>[str(lemma.name()) for lemma in wn.synset('prime_minister.n.01').lemmas()] ['Prime_Minister', 'PM', 'premier']

It keeps acronyms the same so maybe you want to check word.lemma() which would give you a different ID according to the context? 它保持首字母缩略词相同所以也许你想检查word.lemma()根据上下文给你一个不同的ID?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM