简体   繁体   English

我是否保留具有数字类别的要素或创建虚拟变量?

[英]Do I leave features with numeric categories as it is or create dummy variables?

I'm working with a dataset that has a combination of numeric features and features that are categories but encoded with integers. 我正在使用具有数字特征和功能组合的数据集,这些功能是类别但用整数编码。 For example if it were a horse race, 例如,如果是赛马,

horse_id   race_date    track_no        race_number    barrier_number  won_race  
1          2016-10-01   100             1              4               1
2          2016-10-01   100             1              3               0
1          2016-10-15   200             3              5               0
...

So, if I'm creating a model of a horse's probability of winning a race, and using the features like race_number (there can be several races on the same track on the same day so that should have an effect on track conditions) and barrier_number (a horse might prefer to be in the inside barriers or outside ones etc.), should i leave those features as it is or create dummy variables indicating 1 (presence) and 0 (absence) of the variable on each row? 所以,如果我正在创建一个马匹赢得比赛概率的模型,并使用像race_number这样的功能(同一天可以在同一个赛道上进行几场比赛,这应该对赛道状况产生影响)和barrier_number (马可能更喜欢在内部障碍物或外部障碍物等),我应该保留这些特征,还是创建虚拟变量,指示每行上变量的1(存在)和0(不存在)?

This is a trivial example but these columns could have a large number of possible values and creating dummy variables will increase the dimension of the features a lot. 这是一个简单的示例,但这些列可能具有大量可能的值,并且创建虚拟变量将大大增加要素的维度。 Is that a tradeoff one has to make, or just leaving a single column do? 这是一个必须做出的权衡,还是只留下一个专栏呢?

Edit: Also, if I leave the columns as it is and covert it into a caregory dtype in pandas, is that a good practice? 编辑:另外,如果我按原样离开列并将其转换为熊猫中的护理dtype,这是一个好习惯吗? Will existing ML libraries like Scikit-learn handle that correctly? 像Scikit-learn这样的现有ML库是否会正确处理?

For the features described ( race_number , barrier_number ) I believe it's perfectly fine to leave as is. 对于描述的功能( race_numberbarrier_number ),我相信离开原样是完全可以的。 However, for the example above, i would encode the track_no feature. 但是,对于上面的示例,我将编码track_no功能。

This is because there is no relation between the individual track_no values. 这是因为各个track_no值之间没有关系。

I would turn the above example to: 我会把上面的例子变为:

horse_id   race_date    track_100      track_200        race_number    barrier_number  won_race  
1          2016-10-01   1              0                1              4               1
2          2016-10-01   1              0                1              3               0
1          2016-10-15   0              1                3              5               0

I hope that helps! 我希望有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM