CRF ++ / Wapiti包括整个句子的类别作为特征

Question

How can I represent category of sentence predicted from Naive Bayes as a feature in CRF++ or Wapiti? 如何将Naive Bayes预测的句子类别表示为CRF ++或Wapiti的功能？

For instance, if the sentence, Tumblr merges with Yahoo. 例如，如果句子是Tumblr merges with Yahoo. ，则Tumblr merges with Yahoo. , is classified as Business , then while composing the training file for crf, where can I indicate the label Business as a feature? ，分类为Business ，然后在编写crf的培训文件时，在哪里可以将标签Business作为要素？ And how should then the template be modeled? 模板应该如何建模？

Should the train file be like this 火车文件应该是这样吗

Tumblr    business    ORG
merges    business    O
with     business    O
Yahoo    business    ORG

Or only include the category with the ORG label? 还是只在类别中加上ORG标签？ How so? 怎么会这样？ And the template file? 和模板文件？

Answer 1

Method 1: You can add business as a feature in the same way you have shown or you can simply write 1 instead of business . 方法1：您可以按照所示的相同方式将business添加为功能，也可以只写1代替business 。 Similarly, for category sports you can add another column and the value in this column shall be 1 for words belonging to sports sentence. 同样，对于类别sports您可以添加另一列，并且对于属于运动句子的单词，该列的值应为1 。 You'll have to add each column in the template file too, respectively. 您还必须分别在模板文件中添加每一列。

U42:%x[0,1] #for business
U43:%x[0,2] #for sports

Method 2: Including category with ORG might not be a good idea because the same ORG can appear in different categories. 方法2：在ORG中包含类别可能不是一个好主意，因为相同的ORG可以出现在不同的类别中。

Answer 2

As far as I know your train file is the only way to include sentence-level annotation, unless you'd consider adapting / implementing a CRF that takes into account sentence-level features. 据我所知，训练文件是包括句子级别注释的唯一方法，除非您考虑采用考虑句子级别功能的CRF。

If you have enough training data and a limited number of categories, this method would probably affect a low weight to sentence categories: it would only be used to distinguish named entities whenever they are ambiguous and when the computed NE categories probabilities are somehow close. 如果您有足够的训练数据且类别数量有限，则此方法可能会影响句子类别的权重较低：仅当命名实体模棱两可且计算出的NE类别概率接近时，才可用于区分命名实体。

Best way would indeed be to train with/without this feature and see if it improves NER! 最好的方法确实是使用/不使用此功能进行训练，看看它是否可以改善NER！ Should be an interesting experimentation :) 应该是一个有趣的实验：）

CRF ++ / Wapiti包括整个句子的类别作为特征

问题描述

2 个解决方案

解决方案1
1 2017-06-07 15:38:50

解决方案2
0 2017-06-08 09:18:28

CRF ++ / Wapiti包括整个句子的类别作为特征

问题描述

2 个解决方案

解决方案1 1 2017-06-07 15:38:50

解决方案2 0 2017-06-08 09:18:28

解决方案1
1 2017-06-07 15:38:50

解决方案2
0 2017-06-08 09:18:28