简体   繁体   English

在 spacy 中,如何确保特定字符始终被视为完整标记?

[英]In spacy, how can I make sure a particular character is always considered a full token?

In spacy, I'd like characters like '€', '$', or '¥' to be always considered a token.在 spacy 中,我希望像 '€'、'$' 或 '¥' 这样的字符始终被视为一个标记。 However it seems sometimes they are made part of a bigger token.然而,有时它们似乎是更大令牌的一部分。 For example, this is good (two tokens)例如,这很好(两个令牌)

>>> len(nlp("100€"))
2

But the following is not what I want (I'd like to obtain two tokens in this case also):但以下不是我想要的(在这种情况下我也想获得两个令牌):

>>> len(nlp("N€"))
1

How could I achieve that with spacy?我怎么能用 spacy 做到这一点? By the way, don't get too focused on the currency example.顺便说一句,不要太专注于货币示例。 I've had this kind of problematic with other kind of characters that have nothing to do with numbers or currencies.我在处理与数字或货币无关的其他类型的字符时遇到了这种问题。 The problem is how to make sure a character is always treated as a full token and not glued to some other string in the sentence.问题是如何确保一个字符总是被视为一个完整的标记,而不是粘在句子中的其他字符串上。

see here .看到这里

Spacy's tokenizer works by iterating over whitespace-separated substrings and looking for things like for prefixes or suffixes, to separate those parts off. Spacy 的标记器通过迭代以空格分隔的子字符串并寻找前缀或后缀之类的东西来工作,以将这些部分分开。 You can add custom prefixes and suffixes as explained in the link above.您可以添加自定义前缀和后缀,如上面链接中所述。

We can use that as follows:我们可以这样使用:

import spacy
nlp = spacy.load('en_core_web_lg')

doc = nlp("N€")
print([t for t in doc])
#[N€]

suffixes = nlp.Defaults.suffixes + ("€", )

suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search


doc = nlp("N€")
print([t for t in doc])
#[N, €]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM