简体   繁体   English

训练NLTK Brill标记器,但使用txt文件作为输入

[英]Training NLTK Brill tagger but using a txt file as an input

Hye everyone. 大家好 I'm now doing my final year project named "Part-Of-Speech Tagger for Malay Language using Brill Tagger". 我现在正在做我的最后一个项目,名为“使用Brill Tagger的马来语语音词条”。

I want to ask about how to train a tagged sentences that i have save in txt file? 我想问一下如何训练我保存在txt文件中的带标记的句子? The input should be in txt files then is being train using brill tagger. 输入应该在txt文件中,然后使用brill tagger进行训练。 after that, i will used a txt file to be the test data. 之后,我将使用txt文件作为测试数据。 but, i stuck on the train part.can you help me? 但是,我卡在火车上。你能帮我吗?

here are some of my codes. 这是我的一些代码。

import nltk  
f = open('gayahidupsihat_tagged.txt')  
malay_tagged = f.read()   

def train_brill_tagger(train_data):
    # Modules for creating the templates.
    from nltk.tag import UnigramTagger
    from nltk.tag.brill import SymmetricProximateTokensTemplate, ProximateTokensTemplate
    from nltk.tag.brill import ProximateTagsRule, ProximateWordsRule
    # The brill tagger module in NLTK.
    from nltk.tag.brill import FastBrillTaggerTrainer
    unigram_tagger = UnigramTagger(train_data)
    templates = [SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
                 ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
                 ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1))]

    trainer = FastBrillTaggerTrainer(initial_tagger=unigram_tagger,
                                   templates=templates, trace=3,
                                   deterministic=True)
    brill_tagger = trainer.train(train_data, max_rules=10)
    print
    return brill_tagger    

malay_train = (malay_tagged[:10]) 
malay_test = (malay_tagged[10:15]) 
malay20 = malay_tagged[20]

mt = train_brill_tagger(malay_train)    
print mt.tag(malay20)

actually, i want to train a tagged paragraph, after that, i will test it using other paragraph. 实际上,我想训练一个带标签的段落,之后,我将使用其他段落对其进行测试。 After that, i will use a tagged sentences to evaluate the brill tagger as the result. 之后,我将使用带标记的句子来评估brill标记器的结果。

example : 例如:

i train this ( gayahidupsihat_train.txt ) -- all one line of input really: 我训练了这个( gayahidupsihat_train.txt )-输入的所有一行真的是:

Gaya\NN hidup\NN sihat\VB boleh\MD lah\UH ditakrifkan\VBZ sebagai\DT
satu\CD amalan\VBZ kehidupan\NN yang\DT membawa\VBZ impak\NN positif\NN
kepada\TO diri\NN seseorang\NN ,\, keluarganya\NN dan\CC masyarakat\NN.
Antara\IN contoh\NN kehidupan\NN yang\DT sihat\VB ialah\DT individu\NN
tersebut\EX hidup\VB dengan\DT penuh\RB ceria\RB tanpa\NN mengalami\VBZ
sebarang\NN masalah\NN yang\DT boleh\MD menjejaskan\VBZ kehidupannya\NN
untuk\TO satu\CD tempoh\NN tertentu\EX pula\DT .\. Sudah\EX pasti\RB
dalam\DT kehidupan\NN era\NN moden\NN yang\DT begitu\DT banyak\RB
tekanan\VB ini\DT gaya\NN hidup\NN sihat\VB menjadi\VBZ satu\NUM
matlamat\NN yang\DT perlu\MD dicapai\VBZ segera\VB. Oleh\PDT itu\DT ,\,
terdapat\EX pelbagai\NN tindakan\VBZ yang\DT boleh\MD dilakukan\VBZ
untuk\TO mencapai\VBZ matlamat\NN ini\DT .\.

then i want to test with this ( gayahidupsihat_test.txt ): 然后我要对此进行测试( gayahidupsihat_test.txt ):

Tindakan\VBP awal\VB ialah\DT seseorang\NN itu\DT perlu\MD
mengamalkan\VBD satu\CD bentuk\NN pemakanan\NN yang\DT seimbang\NN
dalam\IN kehidupannya\VBZ .\.Dalam\IN keadaan\NN kehidupan\NN sebenar\JJ
,\, orang\NN ramai\JJ lebih\JJR suka\VB mengambil\VBZ makanan\NN yang\DT
bersifat\VBZ mudah\JJ seperti\DT mengamalkan\VBZ pengambilan\VBD makanan\NN
ringan\JJ ataupun\CC makanan\NN segera\NN .\. TidaK\DT kurang\JJR juga\DT
masyarakat\NN kita\PRP hari\NN ini\DT yang\DT lupa\VB kesan\NN pengambilan\VBZ
makanan\NN berlemak\JJR ataupun\CC makanan\NN yang\DT mempunyai\VBZ
kandungan\NN garam\NN ,\. gula\NN atau\DT sodium\FW glutamit\FW yang\DT
tinggi\JJ .\. Hal\IN ini\DT boleh\MD mendatangkan\VBZ pelbagai\NN penyakit\NN
kronik\JJ seperti\DT sakit\JJ jantung\NN ,\, darah\NN tinggi\JJ
ataupun\CC kencing\NN manis\JJ yang\DT juga\DT menjadi\MD punca\NN kematian\NN
tertinggi\JJS di\IN negara\NN kita\PRP .\. 

After that, I will use some tagged_words to try the tagger and evaluate it. 之后,我将使用一些tagged_words来尝试标记器并对其进行评估。

The English version shows output like this: 英文版显示如下输出:

Training Brill tagger on 500 sentences...
Finding initial useful rules...
Found 10210 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  46  46   0   0  | TO -> IN if the tag of the following word is 'AT'
  18  20   2   0  | TO -> IN if the tag of words i+1...i+3 is 'CD'
  14  14   0   0  | IN -> IN-TL if the tag of the preceding word is
                  |   'NN-TL', and the tag of the following word is
                  |   'NN-TL'
  11  11   0   1  | TO -> IN if the tag of the following word is 'NNS'
  10  10   0   0  | TO -> IN if the tag of the following word is 'JJ'
   8   8   0   0  | , -> ,-HL if the tag of the preceding word is 'NP-
                  |   HL'
   7   7   0   1  | NN -> VB if the tag of the preceding word is 'MD'
   7  13   6   0  | NN -> VB if the tag of the preceding word is 'TO'
   7   7   0   0  | NP-TL -> NP if the tag of words i+1...i+2 is 'NNS'
   7   7   0   0  | VBN -> VBD if the tag of the preceding word is
                  |   'NP'`

You need to parse your input files (both train and test) into the format that the NLTK toolchain recognizes: A file is a list (or sequence) of sentences, a sentence is a list of tagged words, and a tagged word is a tuple of two strings, (word, tag) . 您需要将输入文件(训练和测试)解析为NLTK工具链可识别的格式:文件是句子的列表(或序列),句子是带标记的单词的列表,带标记的单词是元组两个字符串(word, tag) In your code, malay_tagged is a simple string (ie, a sequence of characters). 在您的代码中, malay_tagged是一个简单的字符串(即,一个字符序列)。

It's not hard to do it yourself, but the NLTK's nltk.corpus.reader.TaggedCorpusReader can parse your file for you. 自己做起来并不难,但是NLTK的nltk.corpus.reader.TaggedCorpusReader可以为您解析文件。 Just make sure to tell it that the word-tag separator in your file is a backslash ( "\\\\" ). 只要确保告诉您文件中的单词标签分隔符是反斜杠( "\\\\" )即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM