简体   繁体   中英

can I use numerical features in crf model

Is it possible/good to add numerical features in crf models? eg position in the sequence.

I'm using CRFsuite . It seems all the features will be converted to string, eg 'pos=0', 'pos=1', which then lose it's meaning as euclidean distance.

Or should I use them to train another model, eg svm, then ensemble with crf models?

I figured out that CRFsuite does handle numerical features, at least according to this documentation :

  • {“string_key”: float_weight, ...} dict where keys are observed features and values are their weights;
  • {“string_key”: bool, ...} dict; True is converted to 1.0 weight, False - to 0.0;
  • {“string_key”: “string_value”, ...} dict; that's the same as {“string_key=string_value”: 1.0, ...}
  • [“string_key1”, “string_key2”, ...] list; that's the same as {“string_key1”: 1.0, “string_key2”: 1.0, ...}
  • {“string_prefix”: {...}} dicts: nested dict is processed and “string_prefix” s prepended to each key.
  • {“string_prefix”: [...]} dicts: nested list is processed and “string_prefix” s prepended to each key.
  • {“string_prefix”: set([...])} dicts: nested list is processed and “string_prefix” s prepended to each key.

As long as:

  1. I keep the input properly formatted;
  2. I use float vs string of float;
  3. I normalize it.

CRF itself can use numerical features, and you should use them, but if your implementations converts them to strings (encodes in the binary form by the "one hot spot encoding") then it might be of reduced significance. I suggest to look for more "pure" CRF which allows continuous variables.

A fun fact is that CRF in its core is just structured MaxEnt (LogisticRegression) which works in continuous domain , this string encoding is actually a way to go from categorical values into continuous domain so your problem is actually a result of "overdesigning" of CRFSuite which forgot about actual capabilities of CRF model.

Just to clarify a bit the answer by Lishu (which is correct but might confuse other readers as it did to me until I tried it). This:

{“string_key”: float_weight, ...} dict where keys are observed features and values are their weights

could have been written as

{“feature_template_name”: feature_value, ...} dict where keys are feature names and values are their values

ie with this you're not setting the weight for the CRF corresponding to this feature_template, but the value of this feature. I prefer to refer to them feature templates that have feature values, so that everything is more clear than just "features". Then, the CRF will learn a weight associated to each of the possible feature_values for this feature_template

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM