简体   繁体   English

NLTK punkt 的训练数据格式

[英]training data format for NLTK punkt

I would like to run nltk Punkt to split sentences.我想运行nltk Punkt来拆分句子。 There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.没有训练模型,所以我单独训练模型,但我不确定我使用的训练数据格式是否正确。

My training data is one sentence per line.我的训练数据是每行一个句子。 I wasn't able to find any documentation about this, only this thread ( https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM ) sheds some light about training data format.我找不到关于此的任何文档,只有此线程 ( https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM ) 阐明了有关训练数据格式的一些信息。

What is the correct training data format for NLTK Punkt sentence tokenizer? NLTK Punkt句子分词器的正确训练数据格式是什么?

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection.是的,Punkt 标记器是神奇的无监督句子边界检测。 And the author's last name is pretty cool too, Kiss and Strunk (2006) .而且作者的姓氏也很酷, Kiss and Strunk (2006) The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).这个想法是使用NO 注释来训练句子边界检测器,因此输入将是任何类型的明文(只要编码一致)。

To train a new model, simply use:要训​​练新模型,只需使用:

import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()

To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :为了获得更高的精度并允许您随时停止训练并仍然为您的分词器保存适当的泡菜,请查看此用于训练德语句子分词器的代码片段, https://github.com/alvations/DLTK/blob/主/dltk/tokenize/tokenizer.py

def train_punktsent(trainfile, modelfile):
  """ Trains an unsupervised NLTK punkt sentence tokenizer. """
  punkt = PunktTrainer()
  try:
    with codecs.open(trainfile, 'r','utf8') as fin:
      punkt.train(fin.read(), finalize=False, verbose=False)
  except KeyboardInterrupt:
    print 'KeyboardInterrupt: Stopping the reading of the dump early!'
  ##HACK: Adds abbreviations from rb_tokenizer.
  abbrv_sent = " ".join([i.strip() for i in \
                         codecs.open('abbrev.lex','r','utf8').readlines()])
  abbrv_sent = "Start"+abbrv_sent+"End."
  punkt.train(abbrv_sent,finalize=False, verbose=False)
  # Finalize and outputs trained model.
  punkt.finalize_training(verbose=True)
  model = PunktSentenceTokenizer(punkt.get_params())
  with open(modelfile, mode='wb') as fout:
    pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
  return model

However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark .但是请注意,句点检测对拉丁句号、问号和感叹号非常敏感 If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation.如果您要为不使用拉丁拼写法的其他语言训练 punkt 分词器,您需要以某种方式破解代码以使用适当的句子边界标点符号。 If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.如果您使用 NLTK 的 punkt 实现,请编辑sent_end_chars变量。

There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize() .除了使用nltk.tokenize.sent_tokenize()的“默认”英语标记器nltk.tokenize.sent_tokenize()还有可用的预训练模型。 Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt它们是: https : //github.com/evandrix/nltk_data/tree/master/tokenizers/punkt

Edited已编辑

Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.请注意,预训练模型目前不可用,因为上面列出的nltk_data github nltk_data库已被删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM