简体   繁体   English

给定一个词,我们可以使用 Spacy 获得所有可能的引理吗?

[英]Given a word can we get all possible lemmas for it using Spacy?

The input word is standalone and not part of a sentence but I would like to get all of its possible lemmas as if the input word were in different sentences with all possible POS tags.输入词是独立的,不是句子的一部分,但我想得到它所有可能的词条,就好像输入词在不同的句子中一样,带有所有可能的 POS 标签。 I would also like to get the lookup version of the word's lemma.我还想获得单词引理的查找版本。

Why am I doing this?我为什么要这样做?

I have extracted lemmas from all the documents and I have also calculated the number of dependency links between lemmas.我从所有文档中提取了引理,并且还计算了引理之间的依赖关系链接的数量。 Both of which I have done using en_core_web_sm .我使用en_core_web_sm完成了这两项工作。 Now, given an input word, I would like to return the lemmas that are linked most frequently to all the possible lemmas of the input word.现在,给定一个输入词,我想返回链接最频繁的词条到输入词的所有可能词条。

So in short, I would like to replicate the behaviour of token._lemma for the input word with all possible POS tags to maintain consistency with the lemma links I have counted.所以简而言之,我想用所有可能的 POS 标签复制token._lemma的行为,以保持与我计算过的引理链接的一致性。

I found it difficult to get lemmas and inflections directly out of spaCy without first constructing an example sentence to give it context.我发现很难直接从 spaCy 中得到引理和变形,而不首先构造一个例句来给它上下文。 This wasn't ideal, so I looked further and found LemmaInflect did this very well.这并不理想,所以我进一步观察,发现LemmaInflect做得很好。

> from lemminflect import getInflection, getAllInflections, getAllInflectionsOOV

> getAllLemmas('watches')
{'NOUN': ('watch',), 'VERB': ('watch',)}

> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',),  'VBP': ('watch',)}

spaCy is just not designed to do this - it's made for analyzing text, not producing text. spaCy 并不是为此而设计的——它是为分析文本而不是生成文本而设计的。

The linked library looks good, but if you want to stick with spaCy or need languages besides English, you can look at spacy-lookups-data , which is the raw data used for lemmas.链接库看起来不错,但如果您想坚持使用 spaCy 或需要英语以外的语言,您可以查看spacy-lookups-data ,这是用于引理的原始数据。 Generally there will be a dictionary for each part of speech that lets you look up the lemma for a word.通常,每个词性都会有一本字典,可以让您查找单词的引理。

To get alternative lemmas, I am trying a combination of Spacy rule_lemmatize and Spacy lookup data.为了获得替代引理,我正在尝试结合使用 Spacy rule_lemmatize和 Spacy 查找数据。 rule_lemmatize may produce more than one valid lemma whereas the lookup data will only offer one lemma for a given word (in the files I have inspected). rule_lemmatize可能会产生不止一个有效的引理,而查找数据只会为给定的单词提供一个引理(在我检查过的文件中)。 There are however cases where the lookup data produces a lemma whilst rule_lemmatize does not.然而,在某些情况下,查找数据会产生引理,而rule_lemmatize不会。

My examples are for Spanish:我的例子是西班牙语:

import spacy
import spacy_lookups_data

import json
import pathlib

# text = "fui"
text = "seguid"
# text = "contenta"
print("Input text: \t\t" + text)

# Find lemmas using rules:
nlp = spacy.load("es_core_news_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp(text)
rule_lemmas = lemmatizer.rule_lemmatize(doc[0])
print("Lemmas using rules: " + ", ".join(rule_lemmas))

# Find lemma using lookup:
lookups_path = str(pathlib.Path(spacy_lookups_data.__file__).parent.resolve()) + "/data/es_lemma_lookup.json"
fileObject = open(lookups_path, "r")
lookup_json = fileObject.read()
lookup = json.loads(lookup_json)
print("Lemma from lookup: \t" + lookup[text])

Output: Output:

Input text:         fui        # I went; I was (two verbs with same form in this tense)
Lemmas using rules: ir, ser    # to go, to be (both possible lemmas returned)
Lemma from lookup:  ser        # to be

Input text:         seguid     # Follow! (imperative)
Lemmas using rules: seguid     # Follow! (lemma not returned) 
Lemma from lookup:  seguir     # to follow

Input text:         contenta   # (it) satisfies (verb); contented (adjective) 
Lemmas using rules: contentar  # to satisfy (verb but not adjective lemma returned)
Lemma from lookup:  contento   # contented (adjective, lemma form)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM