简体   繁体   English

CRF ++ / Wapiti包括整个句子的类别作为特征

[英]CRF++/Wapiti include category of entire sentence as feature

How can I represent category of sentence predicted from Naive Bayes as a feature in CRF++ or Wapiti? 如何将Naive Bayes预测的句子类别表示为CRF ++或Wapiti的功能?

For instance, if the sentence, Tumblr merges with Yahoo. 例如,如果句子是Tumblr merges with Yahoo. ,则Tumblr merges with Yahoo. , is classified as Business , then while composing the training file for crf, where can I indicate the label Business as a feature? ,分类为Business ,然后在编写crf的培训文件时,在哪里可以将标签Business作为要素? And how should then the template be modeled? 模板应该如何建模?

Should the train file be like this 火车文件应该是这样吗

Tumblr    business    ORG
merges    business    O
with     business    O
Yahoo    business    ORG

Or only include the category with the ORG label? 还是只在类别中加上ORG标签? How so? 怎么会这样? And the template file? 和模板文件?

Method 1: You can add business as a feature in the same way you have shown or you can simply write 1 instead of business . 方法1:您可以按照所示的相同方式将business添加为功能,也可以只写1代替business Similarly, for category sports you can add another column and the value in this column shall be 1 for words belonging to sports sentence. 同样,对于类别sports您可以添加另一列,并且对于属于运动句子的单词,该列的值应为1 You'll have to add each column in the template file too, respectively. 您还必须分别在模板文件中添加每一列。

U42:%x[0,1] #for business
U43:%x[0,2] #for sports

Method 2: Including category with ORG might not be a good idea because the same ORG can appear in different categories. 方法2:在ORG中包含类别可能不是一个好主意,因为相同的ORG可以出现在不同的类别中。

As far as I know your train file is the only way to include sentence-level annotation, unless you'd consider adapting / implementing a CRF that takes into account sentence-level features. 据我所知,训练文件是包括句子级别注释的唯一方法,除非您考虑采用考虑句子级别功能的CRF。

If you have enough training data and a limited number of categories, this method would probably affect a low weight to sentence categories: it would only be used to distinguish named entities whenever they are ambiguous and when the computed NE categories probabilities are somehow close. 如果您有足够的训练数据且类别数量有限,则此方法可能会影响句子类别的权重较低:仅当命名实体模棱两可且计算出的NE类别概率接近时,才可用于区分命名实体。

Best way would indeed be to train with/without this feature and see if it improves NER! 最好的方法确实是使用/不使用此功能进行训练,看看它是否可以改善NER! Should be an interesting experimentation :) 应该是一个有趣的实验:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM