简体   繁体   English

强制 spaCy 引理为小写

[英]Force spaCy lemmas to be lowercase

Is it possible to leave the token text true cased, but force the lemmas to be lowercased?是否可以将标记文本保留为 true 大小写,但强制使引理小写? I am interested in this because I want to use the PhraseMatcher where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not.我对此很感兴趣,因为我想使用PhraseMatcher ,在其中通过管道运行输入文本,然后在该文本上搜索匹配的短语,其中每个搜索查询都可以区分大小写。 In the case that I search by Lemma, i'd like the search to be case insensitive by default.在我通过引理搜索的情况下,我希望搜索默认不区分大小写。

eg例如

doc = nlp(text)

for query in queries:
    if case1:
         attr = "LEMMA"
    elif case2:
         attr = "ORTH"
    elif case3:
         attr = "LOWER"
    phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
    phrase_matcher.add(key, query)
    matches = phrase_matcher(doc)

In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.在第 1 种情况下,我希望匹配不区分大小写,并且如果 spaCy 库中有一些东西可以强制默认情况下使引理小写,这将比保留多个版本的文档并强制一个人拥有所有版本更有效小写字符。

This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. spacy 的这部分是从版本到版本的变化,上次看词形还原是几个版本之前的。 So this solution might not be the most elegant one, but it is definitely a simple one:所以这个解决方案可能不是最优雅的解决方案,但它绝对是一个简单的解决方案:

# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")

You will need to figure out where in the pipeline to add it to.您需要弄清楚要将其添加到管道中的哪个位置。 The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called.最新的文档提到Lemmatizer使用 POS 标记信息,所以我不确定在什么时候调用它。 Placing your pipe after tagger is safe, all the lemmas should be figured out by then.将管道放在tagger之后是安全的,到那时应该弄清楚所有引理。

Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.我能想到的另一个选择是从Lemmatizer类派生自定义词Lemmatizer并覆盖其__call__方法,但这可能是非常具有侵入性的,因为您需要弄清楚如何(以及在​​何处)插入您自己的词形还原器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM