[英]Tokenizing based on certain pattern with Python
I have to tokenize certain patterns from sentences that have Sentences like abc ABC - - 12 V
and ab abc 1,2W
.我必须从具有诸如
abc ABC - - 12 V
和ab abc 1,2W
类的句子的句子中标记某些模式。 Here both 12 V
and 1,2W
are values with units.这里
12 V
和1,2W
都是带单位的值。 So I want to tokenize as abc
, ABC
and 12 V
.所以我想标记为
abc
, ABC
和12 V
。 For the other case: ab
, abc
, 1,2W
.对于另一种情况:
ab
, abc
, 1,2W
。 How can I do that?我怎样才能做到这一点? Well nltk word_tokenizer is an option but I can not insert any pattern, or can I?
那么 nltk word_tokenizer 是一个选项,但我不能插入任何模式,或者我可以吗?
word_tokenize(test_word)
If your input is predictable, in the sense that you know which characters appear between your tokens (in this case I see a space and a hyphen), you can use a regex to extract what you want:如果您的输入是可预测的,即您知道标记之间出现了哪些字符(在这种情况下,我看到一个空格和一个连字符),您可以使用正则表达式来提取您想要的内容:
import re
def is_float(s):
return re.match(r'^-?\d+(?:\.|,\d+)?$', s)
def extract_tokens(phrase, noise="-"):
phrase_list = re.split("\s+", re.sub(noise, " ", phrase).strip())
phrase_tokenized = []
i, n = 0, len(phrase_list)
while i < n:
phrase_tokenized.append(phrase_list[i])
if phrase_list[i].isdigit() or is_float(phrase_list[i]) and i < n-1:
phrase_tokenized[-1] += " " + phrase_list[i+1]
i += 1
i += 1
return phrase_tokenized
Sample test:样品测试:
>>> extract_tokens("abc ABC - - 12 V")
['abc', 'ABC', '12 V']
>>> extract_tokens("ab abc 1,2W")
['ab', 'abc', '1,2W']
And to "insert a pattern" all you need to do is update the noise
parameter according to what you want.而要“插入模式”,您需要做的就是根据您的需要更新
noise
参数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.