[英]Force spaCy lemmas to be lowercase
Is it possible to leave the token text true cased, but force the lemmas to be lowercased?是否可以将标记文本保留为 true 大小写,但强制使引理小写? I am interested in this because I want to use the
PhraseMatcher
where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not.我对此很感兴趣,因为我想使用
PhraseMatcher
,在其中通过管道运行输入文本,然后在该文本上搜索匹配的短语,其中每个搜索查询都可以区分大小写。 In the case that I search by Lemma, i'd like the search to be case insensitive by default.在我通过引理搜索的情况下,我希望搜索默认不区分大小写。
eg例如
doc = nlp(text)
for query in queries:
if case1:
attr = "LEMMA"
elif case2:
attr = "ORTH"
elif case3:
attr = "LOWER"
phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
phrase_matcher.add(key, query)
matches = phrase_matcher(doc)
In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.在第 1 种情况下,我希望匹配不区分大小写,并且如果 spaCy 库中有一些东西可以强制默认情况下使引理小写,这将比保留多个版本的文档并强制一个人拥有所有版本更有效小写字符。
This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. spacy 的这部分是从版本到版本的变化,上次看词形还原是几个版本之前的。 So this solution might not be the most elegant one, but it is definitely a simple one:
所以这个解决方案可能不是最优雅的解决方案,但它绝对是一个简单的解决方案:
# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
for token in doc :
token.lemma_ = token.lemma_.lower()
return doc
# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")
You will need to figure out where in the pipeline to add it to.您需要弄清楚要将其添加到管道中的哪个位置。 The latest documentation mentions that the
Lemmatizer
uses POS tagging info, so I am not sure at what point it is called.最新的文档提到
Lemmatizer
使用 POS 标记信息,所以我不确定在什么时候调用它。 Placing your pipe after tagger
is safe, all the lemmas should be figured out by then.将管道放在
tagger
之后是安全的,到那时应该弄清楚所有引理。
Another option I can think of is to derive a custom lemmatizer from Lemmatizer
class and override its __call__
method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.我能想到的另一个选择是从
Lemmatizer
类派生自定义词Lemmatizer
并覆盖其__call__
方法,但这可能是非常具有侵入性的,因为您需要弄清楚如何(以及在何处)插入您自己的词形还原器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.