I would like to use attribute selection for a numeric data-set. My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here (datasets-numeric.jar) Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is : How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say eg engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances. Note that relief is more heuristic (ie is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent. However, I'd say that actually your results are pretty similar: eg curb-weight
and horsepower
are pretty close to the top in both methods.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.