简体   繁体   English

具有分类输入的回归树或随机森林回归量

[英]Regression trees or Random Forest regressor with categorical inputs

I have been trying to use a categorical inpust in a regression tree (or Random Forest Regressor) but sklearn keeps returning errors and asking for numerical inputs. 我一直试图在回归树(或随机森林回归器)中使用分类的inpust,但sklearn不断返回错误并要求输入数字。

import sklearn as sk
MODEL = sk.ensemble.RandomForestRegressor(n_estimators=100)
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

MODEL = sk.tree.DecisionTreeRegressor()
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

To my understanding, categorical inputs should be possible in these methods without any conversion (eg WOE substitution). 根据我的理解,这些方法中的分类输入应该是可能的,没有任何转换(例如WOE替代)。

Has anyone else had this difficulty? 有没有其他人有这个困难?

thanks! 谢谢!

scikit-learn has no dedicated representation for categorical variables (aka factors in R), one possible solution is to encode the strings as int using LabelEncoder : scikit-learn没有专门的分类变量表示(也就是R中的因子),一种可能的解决方案是使用LabelEncoder将字符串编码为int

import numpy as np
from sklearn.preprocessing import LabelEncoder  
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:,0]) 

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))

Output: 输出:

[[ 0.  1.  2.]
 [ 1.  2.  3.]
 [ 0.  3.  2.]
 [ 2.  1.  3.]]
[ 1.61333333  2.13666667  2.53333333  2.95333333]

But remember that this is a slight hack if a and b are independent categories and it only works with tree-based estimators. 但请记住,如果ab是独立的类别,这只是一个轻微的黑客,它只适用于基于树的估算器。 Why? 为什么? Because b is not really bigger than a . 因为b并不比a The correct way would be to use the OneHotEncoder after the LabelEncoder or pd.get_dummies yielding two separate, one-hot encoded columns for X[:, 0] . 正确的方法是在LabelEncoder之后使用OneHotEncoderpd.get_dummiesX[:, 0]产生两个单独的,一个热编码的列。

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
import pandas as pd
X_0 = pd.get_dummies(X[:, 0]).values
X = np.column_stack([X_0, X[:, 1:]])

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))

You must dummy code by hand in python. 你必须在python中手动编写代码。 I would suggest using pandas.get_dummies() for one hot encoding. 我建议使用pandas.get_dummies()进行一次热编码。 For Boosted trees I have had success using factorize() to achieve Ordinal Encoding. 对于Boosted树,我已经成功使用factorize()来实现Ordinal编码。

There is also a whole package for this sort of thing here . 还有一个全包了这样的事情在这里

For a more detailed explanation look in this Data Science Stack Exchange post. 有关更详细的说明,请参阅数据科学堆栈交换帖子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM