简体   繁体   English

我可以将自定义令牌规则应用于由 spaCy 中的前缀拆分的令牌吗?

[英]Can I apply custom token rules to tokens split by prefixes in spaCy?

I customized a spaCy Tokenizer with additional rules and prefixes to treat w/ and f/ as with and for , respectively.我定制了一个带有附加规则和前缀的 spaCy Tokenizer ,分别将w/f/视为withfor The prefixes are correctly splitting them off, but the custom rules for lemmas and norms is not being applied in that case.前缀正确地将它们分开,但在这种情况下没有应用引理和规范的自定义规则。

Here's an excerpt of the code.这是代码的摘录。

def create_tokenizer(nlp):
    rules = dict(nlp.Defaults.tokenizer_exceptions)
    rules.update({
        'w/': [{ORTH: 'w/', LEMMA: 'with', NORM: 'with'}],
        'W/': [{ORTH: 'W/', LEMMA: 'with', NORM: 'with'}],
        'f/': [{ORTH: 'f/', LEMMA: 'for', NORM: 'for'}],
        'F/': [{ORTH: 'F/', LEMMA: 'for', NORM: 'for'}],
    })

    custom_prefixes = (
        r"[wW]/",
        r"[fF]/",
    )

    prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes + custom_prefixes)

    return Tokenizer(
        nlp.vocab,
        rules=rules,
        prefix_search=prefix_re.search,
    )

Here's the result.这是结果。

>>> doc = nlp("This w/ that")
>>> doc[1]
w/
>>> doc[1].norm_
'with'
>>> doc = nlp("This w/that")
>>> doc[1]
w/
>>> doc[1].norm_
'w/'

In the case of This w/that , the w/ is getting split off, but it doesn't have the custom rules applied (ie, the NORM is w/ instead of with ).This w/that的情况下, w/被拆分,但它没有应用自定义规则(即, NORMw/而不是with )。 What do I need to do to have custom rules applied to tokens split off by prefixes/infixes/suffixes?我需要做什么才能将自定义规则应用于按前缀/中缀/后缀拆分的标记?

Unfortunately there's no way to have prefixes and suffixes also analyzed as exceptions in spacy v2.不幸的是,在 spacy v2 中,无法将前缀和后缀也分析为异常。 Tokenizer exceptions will be handled more generally in the upcoming spacy v3 release in order to support cases like this, but I don't know when the release might be at this point. Tokenizer 异常将在即将发布的 spacy v3 版本中得到更普遍的处理,以支持这样的案例,但我不知道此时可能何时发布。

I think the best you can do in spacy v2 is to have a quick postprocessing component that assigns the lemmas/norms to the individual tokens if they match the orth pattern.我认为您在 spacy v2 中可以做的最好的事情是拥有一个快速的后处理组件,如果它们与正交模式匹配,则将它们分配给各个标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM