简体   繁体   English

使用Mallet进行朴素贝叶斯分类:如何以及在哪里设置字母?

[英]Using Mallet for Naive Bayes classification: How and where are Alphabets set up?

I am trying to use the MALLET machine-learning library in a project for word sense disambiguation. 我试图在一个项目中使用MALLET机器学习库来消除词义歧义。 My feature vectors consist of a fixed-size token window of x tokens to the left and right of the target token. 我的特征向量由目标令牌左右两侧x个令牌的固定大小令牌窗口组成。 The MALLET training instances are created like this: MALLET训练实例是这样创建的:

// Create training list
Pipe pipe = new TokenSequenceLowercase();
InstanceList instanceList = new InstanceList(pipe);
Instance trainingInstance = new Instance(data, senseID, instanceID, text);
instanceList.add(trainingInstance);
...
// Training
ClassifierTrainer classifierTrainer = new NaiveBayesTrainer();
Classifier classifier = classifierTrainer.train(trainingList);

where 哪里

  • "data" is an ArrayList<String> with the feature tokens “数据”是具有功能标记的ArrayList <String>
  • "senseID" is the class label of the respective word sense “ senseID”是相应词义的类别标签
  • "instanceID" is just a String to identify the training instance “ instanceID”只是一个用于标识训练实例的字符串
  • "text" is the original source text “文本”是原始源文本

I would have expected that the dataAlphabet and targetAlphabet properties of the InstanceList are built on the fly as training instances are being added, but this is not the case. 我曾希望InstanceList的dataAlphabet和targetAlphabet属性是在添加训练实例时动态构建的,但事实并非如此。 Consequently, my code fails in the last line above with an NPE, since the targetAlphabet property of the NB trainer is NULL. 因此,由于NB训练器的targetAlphabet属性为NULL,因此我的代码在上面的最后一行中使用NPE失败。

Looking at the MALLET code (thanks to open-source), I can see that the root-cause for the non-construction of the Alphabets is that my data and labels don't implement the AlphabetCarrying interface. 查看MALLET代码(感谢开放源代码),我可以看到,无法构造Alphabets的根本原因是我的数据和标签未实现AlphabetCarrying接口。 Therefore, NULL is returned in the Instance class here: 因此,在此处的Instance类中返回NULL:

public Alphabet getDataAlphabet() {
    if (data instanceof AlphabetCarrying)
        return ((AlphabetCarrying)data).getAlphabet();
    else
        return null;
}

I find this rather confusing, because the documentation says that data and labels can be of any object type. 我觉得这很混乱,因为文档说数据和标签可以是任何对象类型。 But this error above seems to indicate on the contrary that I need to construct a specific data / label class that implements AlphabetCarrying. 但是上面的错误似乎相反地表明我需要构造一个特定的数据/标签类来实现AlphabetCarrying。

I feel like I am I missing something important on the conceptual level regarding these Alphabets. 我觉得我在概念上缺少有关这些字母的重要信息。 Also, I am not clear, if the data alphabet should be derived from all the training instances or just one. 另外,我不清楚,数据字母表是应该从所有训练实例中得出还是仅从一个训练实例中得出。 Can someone explain the error here? 有人可以在这里解释错误吗?

Cheers, 干杯,

Martin 马丁

Answering my own question here: The solution was to add some pipes, specifically a TokenSequence2FeatureSequence pipe to build the data alphabet and a Target2Label to build the label alphabet. 在这里回答我自己的问题:解决方案是添加一些管道,特别是用于构建数据字母的TokenSequence2FeatureSequence管道和用于构建标签字母的Target2Label。 Also, the trainining instances need to be added using instanceList.addThruPipe(trainingInstance). 另外,需要使用instanceList.addThruPipe(trainingInstance)添加训练实例。

This is based on answers from the Mallet mailing list. 这是基于Mallet邮件列表中的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM