简体   繁体   中英

Converting numeric feature into categorical feature

I'm working on a problem to forecast future electronic store sales from historical data. One of the features I'm using is item price (float). I've found experimentally that adding this to an existing list of features degrades fitting and validation accuracy (increases prediction RMSE) of my xgboost model. I suspect that the impact of price may be highly non-linear, with peaks at the prices of memory sticks, laptops, cell phones, etc.

Anyway, I got the following idea to cope with this: How about if I convert the float item price to a categorical variable, with ability to specify the mapping, eg, ranges of values or deciles? Then, I could mean-encode that categorical variable using the training target value item price .

Does this make sense? Could you give me a pointer to a Python "linear/decile histogrammer" that returns, for a list of float quantity, return a parallel list of which bin/decile each float belongs to?

IMHO, you can use qcut , KBinsDiscretizer or cut .

Some examples,

>>> df = pd.DataFrame(np.random.randn(10), columns=['a'])
>>> df
          a
0  0.060278
1 -0.618677
2 -0.472467
3  1.539958
4 -0.181974
5  1.563588
6 -1.693140
7  1.868881
8  1.072179
9  0.575978

For qcut ,

>>> df['cluster'] = pd.qcut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       1
2 -0.472467       2
3  1.539958       4
4 -0.181974       2
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       3

For KBinsDiscretizer ,

>>> (df['cluster'] = 
     KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
     .fit_transform(df.a.values.reshape(-1, 1)))
>>> df
          a  cluster
0  0.060278      1.0
1 -0.618677      0.0
2 -0.472467      0.0
3  1.539958      2.0
4 -0.181974      1.0
5  1.563588      2.0
6 -1.693140      0.0
7  1.868881      2.0
8  1.072179      2.0
9  0.575978      1.0

For cut ,

>>> df['cluster'] = pd.cut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       2
2 -0.472467       2
3  1.539958       5
4 -0.181974       3
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       4

Finally, it may be useful to look at: What is the difference between pandas.qcut and pandas.cut?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM