简体   繁体   English

将数字特征转换为分类特征

[英]Converting numeric feature into categorical feature

I'm working on a problem to forecast future electronic store sales from historical data.我正在解决一个从历史数据预测未来电子商店销售的问题。 One of the features I'm using is item price (float).我正在使用的功能之一是商品价格(浮动)。 I've found experimentally that adding this to an existing list of features degrades fitting and validation accuracy (increases prediction RMSE) of my xgboost model.我通过实验发现,将它添加到现有的特征列表会降低我的xgboost模型的拟合和验证准确性(增加预测 RMSE)。 I suspect that the impact of price may be highly non-linear, with peaks at the prices of memory sticks, laptops, cell phones, etc.我怀疑价格的影响可能是高度非线性的,记忆棒、笔记本电脑、手机等的价格会达到峰值。

Anyway, I got the following idea to cope with this: How about if I convert the float item price to a categorical variable, with ability to specify the mapping, eg, ranges of values or deciles?无论如何,我有以下想法来解决这个问题:如果我将浮动项目价格转换为分类变量,并能够指定映射,例如值或十分位数的范围,如何? Then, I could mean-encode that categorical variable using the training target value item price .然后,我可以使用训练目标值item price对该分类变量进行均值编码

Does this make sense?这有意义吗? Could you give me a pointer to a Python "linear/decile histogrammer" that returns, for a list of float quantity, return a parallel list of which bin/decile each float belongs to?你能给我一个指向 Python“线性/十分位数直方图”的指针,它返回一个浮点数列表,返回一个并行列表,每个浮点数属于哪个 bin/decile?

IMHO, you can use qcut , KBinsDiscretizer or cut .恕我直言,您可以使用qcutKBinsDiscretizercut

Some examples,一些例子,

>>> df = pd.DataFrame(np.random.randn(10), columns=['a'])
>>> df
          a
0  0.060278
1 -0.618677
2 -0.472467
3  1.539958
4 -0.181974
5  1.563588
6 -1.693140
7  1.868881
8  1.072179
9  0.575978

For qcut ,对于qcut

>>> df['cluster'] = pd.qcut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       1
2 -0.472467       2
3  1.539958       4
4 -0.181974       2
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       3

For KBinsDiscretizer ,对于KBinsDiscretizer

>>> (df['cluster'] = 
     KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
     .fit_transform(df.a.values.reshape(-1, 1)))
>>> df
          a  cluster
0  0.060278      1.0
1 -0.618677      0.0
2 -0.472467      0.0
3  1.539958      2.0
4 -0.181974      1.0
5  1.563588      2.0
6 -1.693140      0.0
7  1.868881      2.0
8  1.072179      2.0
9  0.575978      1.0

For cut ,对于cut

>>> df['cluster'] = pd.cut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       2
2 -0.472467       2
3  1.539958       5
4 -0.181974       3
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       4

Finally, it may be useful to look at: What is the difference between pandas.qcut and pandas.cut?最后,看一下可能有用: pandas.qcut 和 pandas.cut 有什么区别?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM