使用 Python 基于特定模式进行标记

Question

I have to tokenize certain patterns from sentences that have Sentences like abc ABC - - 12 V and ab abc 1,2W .我必须从具有诸如abc ABC - - 12 V和ab abc 1,2W类的句子的句子中标记某些模式。 Here both 12 V and 1,2W are values with units.这里12 V和1,2W都是带单位的值。 So I want to tokenize as abc , ABC and 12 V .所以我想标记为abc ， ABC和12 V 。 For the other case: ab , abc , 1,2W .对于另一种情况： ab ， abc ， 1,2W 。 How can I do that?我怎样才能做到这一点？ Well nltk word_tokenizer is an option but I can not insert any pattern, or can I?那么 nltk word_tokenizer 是一个选项，但我不能插入任何模式，或者我可以吗？ word_tokenize(test_word)

Answer 1

If your input is predictable, in the sense that you know which characters appear between your tokens (in this case I see a space and a hyphen), you can use a regex to extract what you want:如果您的输入是可预测的，即您知道标记之间出现了哪些字符（在这种情况下，我看到一个空格和一个连字符），您可以使用正则表达式来提取您想要的内容：

import re

def is_float(s):
    return re.match(r'^-?\d+(?:\.|,\d+)?$', s) 

def extract_tokens(phrase, noise="-"):
    phrase_list = re.split("\s+", re.sub(noise, " ", phrase).strip())
    phrase_tokenized = []
    i, n = 0, len(phrase_list)
    while i < n:
        phrase_tokenized.append(phrase_list[i])
        if phrase_list[i].isdigit() or is_float(phrase_list[i]) and i < n-1:
            phrase_tokenized[-1] += " " + phrase_list[i+1]
            i += 1
        i += 1
    return phrase_tokenized

Sample test:样品测试：

>>> extract_tokens("abc ABC - - 12 V")
['abc', 'ABC', '12 V']
>>> extract_tokens("ab abc 1,2W")
['ab', 'abc', '1,2W']

And to "insert a pattern" all you need to do is update the noise parameter according to what you want.而要“插入模式”，您需要做的就是根据您的需要更新noise参数。

使用 Python 基于特定模式进行标记

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-04-18 19:03:20

使用 Python 基于特定模式进行标记

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-04-18 19:03:20

解决方案1
2 已采纳 2020-04-18 19:03:20