简体繁体 English

我可以在crf模型中使用数字特征吗？

[英]can I use numerical features in crf model

原文 2014-10-01 23:40:44 4 3 machine-learning/ nlp/ data-mining/ data-modeling/ crf

Is it possible/good to add numerical features in crf models? 在crf模型中添加数字特征是否可行/好？ eg position in the sequence. 例如，序列中的位置。

I'm using CRFsuite . 我正在使用CRFsuite 。 It seems all the features will be converted to string, eg 'pos=0', 'pos=1', which then lose it's meaning as euclidean distance. 似乎所有的特征都将被转换为字符串，例如'pos = 0'，'pos = 1'，然后失去它作为欧几里德距离的含义。

Or should I use them to train another model, eg svm, then ensemble with crf models? 或者我应该用它们训练另一个模型，例如svm，然后用crf模型合奏？

3 个解决方案

I figured out that CRFsuite does handle numerical features, at least according to this documentation : 我发现CRFsuite确实处理了数字特征，至少根据这个文档：

{“string_key”: float_weight, ...} dict where keys are observed features and values are their weights; {“string_key”：float_weight，...} dict其中键被观察到的特征和值是它们的权重;

{“string_key”: bool, ...} dict; {“string_key”：bool，...} dict; True is converted to 1.0 weight, False - to 0.0; True转换为1.0重量，False - 转换为0.0;

{“string_key”: “string_value”, ...} dict; {“string_key”：“string_value”，...} dict; that's the same as {“string_key=string_value”: 1.0, ...} 这与{“string_key = string_value”相同：1.0，...}

[“string_key1”, “string_key2”, ...] list; [“string_key1”，“string_key2”，...]列表; that's the same as {“string_key1”: 1.0, “string_key2”: 1.0, ...} 这与{“string_key1”：1.0，“string_key2”：1.0，...}相同

{“string_prefix”: {...}} dicts: nested dict is processed and “string_prefix” s prepended to each key. {“string_prefix”：{...}} dicts：处理嵌套的dict，并为每个键添加“string_prefix”。

{“string_prefix”: [...]} dicts: nested list is processed and “string_prefix” s prepended to each key. {“string_prefix”：[...]} dicts：处理嵌套列表，并为每个键添加“string_prefix”。

{“string_prefix”: set([...])} dicts: nested list is processed and “string_prefix” s prepended to each key. {“string_prefix”：set（[...]）} dicts：处理嵌套列表，并为每个键添加“string_prefix”。

As long as: 只要：

I keep the input properly formatted; 我保持输入格式正确;
I use float vs string of float; 我使用float vs float的字符串;
I normalize it. 我规范它。

CRF itself can use numerical features, and you should use them, but if your implementations converts them to strings (encodes in the binary form by the "one hot spot encoding") then it might be of reduced significance. CRF本身可以使用数字特征，你应该使用它们，但如果你的实现将它们转换为字符串（通过“一个热点编码”以二进制形式编码），那么它的重要性可能会降低。 I suggest to look for more "pure" CRF which allows continuous variables. 我建议寻找更多“纯粹”的CRF，它允许连续变量。

A fun fact is that CRF in its core is just structured MaxEnt (LogisticRegression) which works in continuous domain , this string encoding is actually a way to go from categorical values into continuous domain so your problem is actually a result of "overdesigning" of CRFSuite which forgot about actual capabilities of CRF model. 一个有趣的事实是CRF的核心只是结构化的MaxEnt（LogisticRegression），它在连续域中工作 ，这种字符串编码实际上是一种从分类值到连续域的方式，所以你的问题实际上是CRFSuite“过度设计”的结果忘记了CRF模型的实际功能。

Just to clarify a bit the answer by Lishu (which is correct but might confuse other readers as it did to me until I tried it). 只是为了澄清一点Lishu的答案（这是正确的，但可能会混淆其他读者，因为它对我来说，直到我尝试它）。 This: 这个：

{“string_key”: float_weight, ...} dict where keys are observed features and values are their weights {“string_key”：float_weight，...} dict其中键被观察到的特征和值是它们的权重

could have been written as 本来可以写成

{“feature_template_name”: feature_value, ...} dict where keys are feature names and values are their values {“feature_template_name”：feature_value，...} dict其中键是要素名称，值是其值

ie with this you're not setting the weight for the CRF corresponding to this feature_template, but the value of this feature. 即，使用此功能，您不会设置与此feature_template对应的CRF的权重，而是设置此功能的值。 I prefer to refer to them feature templates that have feature values, so that everything is more clear than just "features". 我更喜欢将它们称为具有特征值的特征模板，以便一切都比“特征”更清晰。 Then, the CRF will learn a weight associated to each of the possible feature_values for this feature_template 然后，CRF将学习与此feature_template的每个可能feature_values相关联的权重