简体   繁体   English

NLTK Python中的训练数据集

[英]Training Data Set in NLTK Python

I am working on Python NLTK tagging, and my input text is non hindi. 我正在使用Python NLTK标记,而我的输入文本不是印地文。 In order to tokenize my input text it must first be trained. 为了标记我的输入文本,必须首先对其进行训练。

My question is how to train the data? 我的问题是如何训练数据?

I am having this line of code as suggested to me here on stackoverflow. 正如我在stackoverflow上向我建议的那样,我有这行代码。

train_data = indian.tagged_sents('hindi.pos') 

*how about non-hindi data input. *非印度语数据输入如何。

The short answer is: Training a tagger requires a tagged corpus. 简短的答案是:训练标记者需要标记的语料库。

Assigning part of speech tags must be done according to some existing model. 语音标签的分配必须根据某些现有模型完成。 Unfortunately, unlike some problems like finding sentence boundaries, there is no way to choose them out of thin air. 不幸的是,与诸如查找句子边界之类的一些问题不同,没有办法凭空选择它们。 There are some experimental approaches that try to assign parts of speech using parallel texts and machine-translation alignment algorithms, but all real POS taggers must be trained on text that has been tagged already. 有一些实验方法尝试使用并行文本和机器翻译对齐算法来分配语音部分,但是必须在已经被标记的文本上训练所有真实的POS标记器。

Evidently you don't have a tagged corpus for your unnamed language, so you'll need to find or create one if you want to build a tagger. 显然,您没有用于未命名语言的带标记的语料库,因此,如果要构建标记器,则需要查找或创建一个。 Creating a tagged corpus is a major undertaking, since you'll need a lot of training materials to get any sort of decent performance. 创建带标签的语料库是一项艰巨的任务,因为您将需要大量的培训材料来获得各种体面的表现。 There may be ways to "bootstrap" a tagged corpus (put together a poor-quality tagger that will make it easier to retag the results by hand), but all that depends on your situation. 可能有办法“引导”已标记的语料库(将质量低劣的标记器放在一起,将使手工重新标记结果更加容易),但这取决于您的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM