简体   繁体   English

我可以在crf模型中使用数字特征吗?

[英]can I use numerical features in crf model

Is it possible/good to add numerical features in crf models? 在crf模型中添加数字特征是否可行/好? eg position in the sequence. 例如,序列中的位置。

I'm using CRFsuite . 我正在使用CRFsuite It seems all the features will be converted to string, eg 'pos=0', 'pos=1', which then lose it's meaning as euclidean distance. 似乎所有的特征都将被转换为字符串,例如'pos = 0','pos = 1',然后失去它作为欧几里德距离的含义。

Or should I use them to train another model, eg svm, then ensemble with crf models? 或者我应该用它们训练另一个模型,例如svm,然后用crf模型合奏?

I figured out that CRFsuite does handle numerical features, at least according to this documentation : 我发现CRFsuite确实处理了数字特征,至少根据这个文档

  • {“string_key”: float_weight, ...} dict where keys are observed features and values are their weights; {“string_key”:float_weight,...} dict其中键被观察到的特征和值是它们的权重;
  • {“string_key”: bool, ...} dict; {“string_key”:bool,...} dict; True is converted to 1.0 weight, False - to 0.0; True转换为1.0重量,False - 转换为0.0;
  • {“string_key”: “string_value”, ...} dict; {“string_key”:“string_value”,...} dict; that's the same as {“string_key=string_value”: 1.0, ...} 这与{“string_key = string_value”相同:1.0,...}
  • [“string_key1”, “string_key2”, ...] list; [“string_key1”,“string_key2”,...]列表; that's the same as {“string_key1”: 1.0, “string_key2”: 1.0, ...} 这与{“string_key1”:1.0,“string_key2”:1.0,...}相同
  • {“string_prefix”: {...}} dicts: nested dict is processed and “string_prefix” s prepended to each key. {“string_prefix”:{...}} dicts:处理嵌套的dict,并为每个键添加“string_prefix”。
  • {“string_prefix”: [...]} dicts: nested list is processed and “string_prefix” s prepended to each key. {“string_prefix”:[...]} dicts:处理嵌套列表,并为每个键添加“string_prefix”。
  • {“string_prefix”: set([...])} dicts: nested list is processed and “string_prefix” s prepended to each key. {“string_prefix”:set([...])} dicts:处理嵌套列表,并为每个键添加“string_prefix”。

As long as: 只要:

  1. I keep the input properly formatted; 我保持输入格式正确;
  2. I use float vs string of float; 我使用float vs float的字符串;
  3. I normalize it. 我规范它。

CRF itself can use numerical features, and you should use them, but if your implementations converts them to strings (encodes in the binary form by the "one hot spot encoding") then it might be of reduced significance. CRF本身可以使用数字特征,你应该使用它们,但如果你的实现将它们转换为字符串(通过“一个热点编码”以二进制形式编码),那么它的重要性可能会降低。 I suggest to look for more "pure" CRF which allows continuous variables. 我建议寻找更多“纯粹”的CRF,它允许连续变量。

A fun fact is that CRF in its core is just structured MaxEnt (LogisticRegression) which works in continuous domain , this string encoding is actually a way to go from categorical values into continuous domain so your problem is actually a result of "overdesigning" of CRFSuite which forgot about actual capabilities of CRF model. 一个有趣的事实是CRF的核心只是结构化的MaxEnt(LogisticRegression),它在连续域中工作 ,这种字符串编码实际上是一种从分类值到连续域的方式,所以你的问题实际上是CRFSuite“过度设计”的结果忘记了CRF模型的实际功能。

Just to clarify a bit the answer by Lishu (which is correct but might confuse other readers as it did to me until I tried it). 只是为了澄清一点Lishu的答案(这是正确的,但可能会混淆其他读者,因为它对我来说,直到我尝试它)。 This: 这个:

{“string_key”: float_weight, ...} dict where keys are observed features and values are their weights {“string_key”:float_weight,...} dict其中键被观察到的特征和值是它们的权重

could have been written as 本来可以写成

{“feature_template_name”: feature_value, ...} dict where keys are feature names and values are their values {“feature_template_name”:feature_value,...} dict其中键是要素名称,值是其值

ie with this you're not setting the weight for the CRF corresponding to this feature_template, but the value of this feature. 即,使用此功能,您不会设置与此feature_template对应的CRF的权重,而是设置此功能的值。 I prefer to refer to them feature templates that have feature values, so that everything is more clear than just "features". 我更喜欢将它们称为具有特征值的特征模板,以便一切都比“特征”更清晰。 Then, the CRF will learn a weight associated to each of the possible feature_values for this feature_template 然后,CRF将学习与此feature_template的每个可能feature_values相关联的权重

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用机器学习模型来预测特征略有不同的数据? - How can I use a machine learning model to predict on data whose features differ slightly? 如何在Caffe中使用数字标签进行回归? - How can I use numerical labels for regression in Caffe? 如何在机器学习中的数值和分类特征上使用统一管道? - How to use unified pipelines on numerical and categorical features in machine learning? 一种热编码分类特征,用作sklearn中具有数字特征的训练数据 - One hot encoding categorical features to use as training data with numerical features in sklearn 如果您有两个类 0 和 1 的数字目标,并且所有特征也是数字的,我应该对目标进行编码吗? - if you have a numerical target of two classes 0 and 1 and all the features are numerical as well, should i encode the target? 如何将流派列转换为数值,以便我可以将其提供给神经网络 model? - How to convert genre column to numerical value so that I can feed it to the neural network model? 如何在SVM中操作多维特征或使用多维特征训练模型? - How to operate multidimensional features in SVM or use multidimensional features to train model? LSTM 模型是否使用特征趋势? - Does an LSTM model use trend in features? 我们可以使用Logistic回归来预测数值(连续)变量,即餐厅收入吗 - Can we use Logistic Regression to predict numerical(continuous) variable i.e Revenue of the Restaurant 如何在Spark中将数字特征与文本(单词袋)正确组合? - How do I properly combine numerical features with text (bag of words) in Spark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM