Converting numeric feature into categorical feature

Question

I'm working on a problem to forecast future electronic store sales from historical data. One of the features I'm using is item price (float). I've found experimentally that adding this to an existing list of features degrades fitting and validation accuracy (increases prediction RMSE) of my xgboost model. I suspect that the impact of price may be highly non-linear, with peaks at the prices of memory sticks, laptops, cell phones, etc.

Anyway, I got the following idea to cope with this: How about if I convert the float item price to a categorical variable, with ability to specify the mapping, eg, ranges of values or deciles? Then, I could mean-encode that categorical variable using the training target value item price .

Does this make sense? Could you give me a pointer to a Python "linear/decile histogrammer" that returns, for a list of float quantity, return a parallel list of which bin/decile each float belongs to?

Answer 1

IMHO, you can use qcut , KBinsDiscretizer or cut .

Some examples,

>>> df = pd.DataFrame(np.random.randn(10), columns=['a'])
>>> df
          a
0  0.060278
1 -0.618677
2 -0.472467
3  1.539958
4 -0.181974
5  1.563588
6 -1.693140
7  1.868881
8  1.072179
9  0.575978

For qcut ,

>>> df['cluster'] = pd.qcut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       1
2 -0.472467       2
3  1.539958       4
4 -0.181974       2
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       3

For KBinsDiscretizer ,

>>> (df['cluster'] = 
     KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
     .fit_transform(df.a.values.reshape(-1, 1)))
>>> df
          a  cluster
0  0.060278      1.0
1 -0.618677      0.0
2 -0.472467      0.0
3  1.539958      2.0
4 -0.181974      1.0
5  1.563588      2.0
6 -1.693140      0.0
7  1.868881      2.0
8  1.072179      2.0
9  0.575978      1.0

For cut ,

>>> df['cluster'] = pd.cut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       2
2 -0.472467       2
3  1.539958       5
4 -0.181974       3
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       4

Finally, it may be useful to look at: What is the difference between pandas.qcut and pandas.cut?

Converting numeric feature into categorical feature

Question

1 answers

solution1
2 ACCPTED 2020-01-03 21:50:46

Converting numeric feature into categorical feature

Question

1 answers

solution1 2 ACCPTED 2020-01-03 21:50:46

solution1
2 ACCPTED 2020-01-03 21:50:46