[英]Regression trees or Random Forest regressor with categorical inputs
I have been trying to use a categorical inpust in a regression tree (or Random Forest Regressor) but sklearn keeps returning errors and asking for numerical inputs. 我一直试图在回归树(或随机森林回归器)中使用分类的inpust,但sklearn不断返回错误并要求输入数字。
import sklearn as sk
MODEL = sk.ensemble.RandomForestRegressor(n_estimators=100)
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works
MODEL = sk.tree.DecisionTreeRegressor()
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works
To my understanding, categorical inputs should be possible in these methods without any conversion (eg WOE substitution). 根据我的理解,这些方法中的分类输入应该是可能的,没有任何转换(例如WOE替代)。
Has anyone else had this difficulty? 有没有其他人有这个困难?
thanks! 谢谢!
scikit-learn
has no dedicated representation for categorical variables (aka factors in R), one possible solution is to encode the strings as int
using LabelEncoder
: scikit-learn
没有专门的分类变量表示(也就是R中的因子),一种可能的解决方案是使用LabelEncoder
将字符串编码为int
:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)])
y = np.asarray([1,2.5,3,4])
# transform 1st column to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:,0])
regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
Output: 输出:
[[ 0. 1. 2.]
[ 1. 2. 3.]
[ 0. 3. 2.]
[ 2. 1. 3.]]
[ 1.61333333 2.13666667 2.53333333 2.95333333]
But remember that this is a slight hack if a
and b
are independent categories and it only works with tree-based estimators. 但请记住,如果
a
和b
是独立的类别,这只是一个轻微的黑客,它只适用于基于树的估算器。 Why? 为什么? Because
b
is not really bigger than a
. 因为
b
并不比a
。 The correct way would be to use the OneHotEncoder
after the LabelEncoder
or pd.get_dummies
yielding two separate, one-hot encoded columns for X[:, 0]
. 正确的方法是在
LabelEncoder
之后使用OneHotEncoder
或pd.get_dummies
为X[:, 0]
产生两个单独的,一个热编码的列。
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)])
y = np.asarray([1,2.5,3,4])
# transform 1st column to numbers
import pandas as pd
X_0 = pd.get_dummies(X[:, 0]).values
X = np.column_stack([X_0, X[:, 1:]])
regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
You must dummy code by hand in python. 你必须在python中手动编写代码。 I would suggest using pandas.get_dummies() for one hot encoding.
我建议使用pandas.get_dummies()进行一次热编码。 For Boosted trees I have had success using factorize() to achieve Ordinal Encoding.
对于Boosted树,我已经成功使用factorize()来实现Ordinal编码。
There is also a whole package for this sort of thing here . 还有一个全包了这样的事情在这里 。
For a more detailed explanation look in this Data Science Stack Exchange post. 有关更详细的说明,请参阅此数据科学堆栈交换帖子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.