简体   繁体   English

nltk自定义标记器和标记器

[英]nltk custom tokenizer and tagger

Here is my requirement. 这是我的要求。 I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs. 我想以一种允许我实现以下内容的方式标记和标记段落。

  • Should identify date and time in the paragraph and Tag them as DATE and TIME 应在段落中标识日期和时间,并将其标记为日期和时间
  • Should identify known phrases in the paragraph and Tag them as CUSTOM 应识别段落中的已知短语并将其标记为CUSTOM
  • And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions? 应该通过默认nltk的word_tokenize和pos_tag函数对其余内容进行标记化吗?

For example , following sentense 例如 ,跟随sentense

"They all like to go there on 5th November 2010, but I am not interested."

should be tagged and tokenized as follows in case of that custom phrase is "I am not interested" . 如果自定义短语是“我不感兴趣”,则应按如下方式标记和标记化。

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

Any suggestions would be useful. 任何建议都会有用。

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. 正确的答案是编译以您想要的方式标记的大型数据集,然后在其上训练机器学习的块。 If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. 如果这太耗费时间,那么简单的方法就是运行POS标记器并使用正则表达式对其输出进行后处理。 Getting the longest match is the hard part here: 获得最长的比赛是这里最难的部分:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. Todo:扩展DATE re,插入代码来搜索CUSTOM短语,通过匹配POS标签和令牌使其变得更复杂,并决定5th应该算作日期。 (Probably not, so filter out dates of length one that only contain an ordinal number.) (可能不是,所以过滤掉只包含序数的长度为1的日期。)

You should probably do chunking with the nltk.RegexpParser to achieve your objective. 您可能应该使用nltk.RegexpParser进行分块以实现您的目标。

Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1 参考: http//nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM