[英]nltk custom tokenizer and tagger
Here is my requirement. 这是我的要求。 I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
我想以一种允许我实现以下内容的方式标记和标记段落。
For example , following sentense 例如 ,跟随sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested" . 如果自定义短语是“我不感兴趣”,则应按如下方式标记和标记化。
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful. 任何建议都会有用。
The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. 正确的答案是编译以您想要的方式标记的大型数据集,然后在其上训练机器学习的块。 If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions.
如果这太耗费时间,那么简单的方法就是运行POS标记器并使用正则表达式对其输出进行后处理。 Getting the longest match is the hard part here:
获得最长的比赛是这里最难的部分:
s = "They all like to go there on 5th November 2010, but I am not interested."
DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')
def custom_tagger(sentence):
tagged = pos_tag(word_tokenize(sentence))
phrase = []
date_found = False
i = 0
while i < len(tagged):
(w,t) = tagged[i]
phrase.append(w)
in_date = DATE.match(' '.join(phrase))
date_found |= bool(in_date)
if date_found and not in_date: # end of date found
yield (' '.join(phrase[:-1]), 'DATE')
phrase = []
date_found = False
elif date_found and i == len(tagged)-1: # end of date found
yield (' '.join(phrase), 'DATE')
return
else:
i += 1
if not in_date:
yield (w,t)
phrase = []
Todo: expand the DATE
re, insert code to search for CUSTOM
phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th
on its own should count as a date. Todo:扩展
DATE
re,插入代码来搜索CUSTOM
短语,通过匹配POS标签和令牌使其变得更复杂,并决定5th
应该算作日期。 (Probably not, so filter out dates of length one that only contain an ordinal number.) (可能不是,所以过滤掉只包含序数的长度为1的日期。)
You should probably do chunking with the nltk.RegexpParser to achieve your objective. 您可能应该使用nltk.RegexpParser进行分块以实现您的目标。
Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1 参考: http : //nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.