简体繁体 English

scikit-learn 随机森林的输入

[英]input for scikit-learn random forest

原文 2014-01-27 09:15:36 8 2 python/ scikit-learn

I am trying to predict the output of tennis matches - just a fun side project.我正在尝试预测网球比赛的结果 - 只是一个有趣的副项目。 Im using a random forest regressor to do this.我使用随机森林回归器来做到这一点。 now, one of the features is the ranking of the player before a specific match.现在，其中一项功能是特定比赛前球员的排名。 for many matches I dont have a ranking (I only have the first 200 ranked).对于许多比赛，我没有排名（我只有前 200 名）。 The question is - is it better to put a value that is not an integer, like for example the string "NoRank" , or put an integer that is beyond the range of 1-200 ?问题是 - 放置一个不是整数的值（例如字符串"NoRank" ）还是放置一个超出1-200范围的整数"NoRank" ？ Considering the learning algorithm, Im inclined to put the value 201 , but I would like to hear your opinions on this.. Thanks!考虑到学习算法，我倾向于将值201 ，但我想听听您对此的意见..谢谢！

2 个解决方案

scikit-learn random forests do not support missing values unfortunately.不幸的是，scikit-learn 随机森林不支持缺失值。 If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.如果您认为未排名的玩家可能表现得最差，平均排名 200 的玩家则输入 201 排名是有道理的。

Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects.注意：所有 scikit-learn 模型都需要同构的数字输入特征，而不是字符串标签或其他 Python 对象。 If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (eg categorical variable identifiers or free text to be extracted as a bag of words).如果您将字符串标签作为特征，您首先需要根据字符串特征的含义（例如分类变量标识符或要提取为词袋的自由文本）找到正确的特征提取策略。

I will be careful with just adding 201 (or any other value) to the nonranked ones.我会小心地将 201（或任何其他值）添加到非排名的值中。 RF normalize the data ( Do I need to normalize (or scale) data for randomForest (R package)? ), which means it can group 200 with 201 in the split, or it might not. RF 对数据进行标准化（我是否需要为 randomForest（R 包）标准化（或缩放）数据？），这意味着它可以将 200 与 201 分组，也可以不分组。 you are basically faking data that you do not have.您基本上是在伪造您没有的数据。

I will add another column: "haverank" and use a 0/1 for it.我将添加另一列：“haverank”并为其使用 0/1。 0 will be for people without rank 1 for people with rank. 0 表示没有等级的人 1 表示有等级的人。

call it "highrank" if the name sounds better.如果这个名字听起来更好，就称它为“highrank”。 you can also add another column named "veryhighrank" and give the value 1 to all players between ranks 1-50.您还可以添加另一个名为“veryhighrank”的列，并将值 1 赋予等级 1-50 之间的所有玩家。 etc...等等...