[英]CRF++/Wapiti include category of entire sentence as feature
How can I represent category of sentence predicted from Naive Bayes as a feature in CRF++ or Wapiti? 如何将Naive Bayes预测的句子类别表示为CRF ++或Wapiti的功能?
For instance, if the sentence, Tumblr merges with Yahoo.
例如,如果句子是
Tumblr merges with Yahoo.
,则Tumblr merges with Yahoo.
, is classified as Business
, then while composing the training file for crf, where can I indicate the label Business
as a feature? ,分类为
Business
,然后在编写crf的培训文件时,在哪里可以将标签Business
作为要素? And how should then the template be modeled? 模板应该如何建模?
Should the train file be like this 火车文件应该是这样吗
Tumblr business ORG
merges business O
with business O
Yahoo business ORG
Or only include the category with the ORG
label? 还是只在类别中加上
ORG
标签? How so? 怎么会这样? And the template file?
和模板文件?
Method 1: You can add business
as a feature in the same way you have shown or you can simply write 1
instead of business
. 方法1:您可以按照所示的相同方式将
business
添加为功能,也可以只写1
代替business
。 Similarly, for category sports
you can add another column and the value in this column shall be 1
for words belonging to sports sentence. 同样,对于类别
sports
您可以添加另一列,并且对于属于运动句子的单词,该列的值应为1
。 You'll have to add each column in the template file too, respectively. 您还必须分别在模板文件中添加每一列。
U42:%x[0,1] #for business
U43:%x[0,2] #for sports
Method 2: Including category with ORG might not be a good idea because the same ORG can appear in different categories. 方法2:在ORG中包含类别可能不是一个好主意,因为相同的ORG可以出现在不同的类别中。
As far as I know your train file is the only way to include sentence-level annotation, unless you'd consider adapting / implementing a CRF that takes into account sentence-level features. 据我所知,训练文件是包括句子级别注释的唯一方法,除非您考虑采用考虑句子级别功能的CRF。
If you have enough training data and a limited number of categories, this method would probably affect a low weight to sentence categories: it would only be used to distinguish named entities whenever they are ambiguous and when the computed NE categories probabilities are somehow close. 如果您有足够的训练数据且类别数量有限,则此方法可能会影响句子类别的权重较低:仅当命名实体模棱两可且计算出的NE类别概率接近时,才可用于区分命名实体。
Best way would indeed be to train with/without this feature and see if it improves NER! 最好的方法确实是使用/不使用此功能进行训练,看看它是否可以改善NER! Should be an interesting experimentation :)
应该是一个有趣的实验:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.