简体   繁体   中英

Weka Attribute selection - justifying different outcomes of different methods

I would like to use attribute selection for a numeric data-set. My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.

For testing, I used the autoPrice.arff that I obtained from here (datasets-numeric.jar) Using ReliefFAttributeEval I get the following outcome:

Ranked attributes:
 **0.05793   8 engine-size**
 **0.04976   5 width**
 0.0456    7 curb-weight
 0.04073  12 horsepower
 0.03787   2 normalized-losses
 0.03728   3 wheel-base
 0.0323   10 stroke
 0.03229   9 bore
 0.02801  13 peak-rpm
 0.02209  15 highway-mpg
 0.01555   6 height
 0.01488   4 length
 0.01356  11 compression-ratio
 0.01337  14 city-mpg
 0.00739   1 symboling

while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:

Ranked attributes:
6.8914   7 curb-weight
5.2409   4 length
5.228    2 normalized-losses
5.0422  12 horsepower
4.7762   6 height
4.6694   3 wheel-base
4.4347  10 stroke
4.3891   9 bore
**4.3388   8 engine-size**
**4.2756   5 width**
4.1509  15 highway-mpg
3.9387  14 city-mpg
3.9011  11 compression-ratio
3.4599  13 peak-rpm
2.2038   1 symboling

My question is : How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say eg engine-size is important and the other says not so much !?

There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.

IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.

RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances. Note that relief is more heuristic (ie is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.

So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent. However, I'd say that actually your results are pretty similar: eg curb-weight and horsepower are pretty close to the top in both methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM