[英]Do I leave features with numeric categories as it is or create dummy variables?
I'm working with a dataset that has a combination of numeric features and features that are categories but encoded with integers. 我正在使用具有数字特征和功能组合的数据集,这些功能是类别但用整数编码。 For example if it were a horse race,
例如,如果是赛马,
horse_id race_date track_no race_number barrier_number won_race
1 2016-10-01 100 1 4 1
2 2016-10-01 100 1 3 0
1 2016-10-15 200 3 5 0
...
So, if I'm creating a model of a horse's probability of winning a race, and using the features like race_number
(there can be several races on the same track on the same day so that should have an effect on track conditions) and barrier_number
(a horse might prefer to be in the inside barriers or outside ones etc.), should i leave those features as it is or create dummy variables indicating 1 (presence) and 0 (absence) of the variable on each row? 所以,如果我正在创建一个马匹赢得比赛概率的模型,并使用像
race_number
这样的功能(同一天可以在同一个赛道上进行几场比赛,这应该对赛道状况产生影响)和barrier_number
(马可能更喜欢在内部障碍物或外部障碍物等),我应该保留这些特征,还是创建虚拟变量,指示每行上变量的1(存在)和0(不存在)?
This is a trivial example but these columns could have a large number of possible values and creating dummy variables will increase the dimension of the features a lot. 这是一个简单的示例,但这些列可能具有大量可能的值,并且创建虚拟变量将大大增加要素的维度。 Is that a tradeoff one has to make, or just leaving a single column do?
这是一个必须做出的权衡,还是只留下一个专栏呢?
Edit: Also, if I leave the columns as it is and covert it into a caregory dtype in pandas, is that a good practice? 编辑:另外,如果我按原样离开列并将其转换为熊猫中的护理dtype,这是一个好习惯吗? Will existing ML libraries like Scikit-learn handle that correctly?
像Scikit-learn这样的现有ML库是否会正确处理?
For the features described ( race_number
, barrier_number
) I believe it's perfectly fine to leave as is. 对于描述的功能(
race_number
, barrier_number
),我相信离开原样是完全可以的。 However, for the example above, i would encode the track_no
feature. 但是,对于上面的示例,我将编码
track_no
功能。
This is because there is no relation between the individual track_no
values. 这是因为各个
track_no
值之间没有关系。
I would turn the above example to: 我会把上面的例子变为:
horse_id race_date track_100 track_200 race_number barrier_number won_race
1 2016-10-01 1 0 1 4 1
2 2016-10-01 1 0 1 3 0
1 2016-10-15 0 1 3 5 0
I hope that helps! 我希望有所帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.