简体   繁体   English

如何在python NLTK中使用正则表达式回退标记器来覆盖NN?

[英]How to use a regex backoff tagger in python NLTK to override NN's?

I've been using a custom trained nltk pos_tagger and sometimes I get obvious verbs (ending with ING or ED) come in as NN's. 我一直在使用经过自定义训练的nltk pos_tagger,有时我会得到一些明显的动词(以ING或ED结尾)作为NN。 How do I get the tagger to process all NN's through an additional regexpTagger just to find the additional verbs? 如何通过附加的regexpTagger使标记器处理所有NN,以查找附加动词?

I've included some sample code for the secondary regex tagger. 我为辅助正则表达式标记器提供了一些示例代码。

from nltk.tag.sequential import RegexpTagger

rgt = RegexpTagger(
    (r'.*ing$', 'VBG'),                # gerunds
    (r'.*ed$', 'VBD'),                 # past tense verbs
])

Thanks 谢谢

Here is tri_gram tagger which is backed off by bi-gram (which is backed off by uni-gram) and the primary back-off tragger being the regex tragger. 这是tri_gram标记器,它由bi-gram(由uni-gram来支持)支持,而主要的退避摇篮是regex摇篮。 So, the last tagging here will be left to regex if any of the other tagger fails to tag it on the basis of rules defined here. 因此,如果其他标记器中的任何一个未能根据此处定义的规则对其进行标记,则此处最后的标记将留给正则表达式。 Hope this helps you to build your own regex tagger of your rules. 希望这可以帮助您构建自己的规则正则表达式标记器。

   from nltk.corpus import brown
   import sys
   from nltk import pos_tag
   from nltk.tokenize import word_tokenize
   import nltk
   from nltk import ne_chunk
   def tri_gram():
   ##Trigram tagger done by training data from brown corpus 
    b_t_sents=brown.tagged_sents(categories='news')

   ##Making n-gram tagger using Turing backoff
   default_tagger = nltk.RegexpTagger(
            [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
         (r'(The|the|A|a|An|an)$', 'AT'),   # articles
         (r'.*able$', 'JJ'),                # adjectives
         (r'.*ness$', 'NN'),                # nouns formed from adjectives  
         (r'.*ly$', 'RB'),                  # adverbs
         (r'.*s$', 'NNS'),                  # plural nouns  
         (r'.*ing$', 'VBG'),                # gerunds   
         (r'.*ed$', 'VBD'),                 # past tense verbs
         (r'.*', 'NN')                      # nouns (default)
        ])
    u_gram_tag=nltk.UnigramTagger(b_t_sents,backoff=default_tagger) 
    b_gram_tag=nltk.BigramTagger(b_t_sents,backoff=u_gram_tag)
    t_gram_tag=nltk.TrigramTagger(b_t_sents,backoff=b_gram_tag)

    ##pos of given text
    f_read=open(sys.argv[1],'r')
    given_text=f_read.read();
    segmented_lines=nltk.sent_tokenize(given_text) 
    for text in segmented_lines:
        words=word_tokenize(text)
        sent = t_gram_tag.tag(words)
        print ne_chunk(sent)
tri_gram()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM