简体   繁体   English

使用sklearn训练不同的回归器

[英]Training different regressors with sklearn

I have a list of Xs and their output value Ys . 我有一个Xs及其输出值Ys的列表。 And using the following code, I am able to train the following regressors: 使用以下代码,我可以训练以下回归器:

  • Linear Regressor 线性回归
  • Isotonic Regressor 等渗回归器
  • Baysian Ridge Regressor 贝斯里奇回归
  • Gradient Boosting Regressor 梯度提升回归器

The code: 编码:

import numpy as np

from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.isotonic import IsotonicRegression
from sklearn import ensemble
from sklearn.svm import SVR
from sklearn.gaussian_process import GaussianProcess


import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection


def get_meteor_scores(infile):
    with io.open(infile, 'r') as fin:
        meteor_scores = [float(i.strip().split()[-1]) for 
                               i in re.findall(r'Segment [0-9].* score\:.*\n', 
                                               fin.read())]
        return meteor_scores

def get_sts_scores(infile):
    with io.open(infile, 'r') as fin:
        sts_scores = [float(i) for i in fin]
        return sts_scores

Xs = 'meteor.output.train'
Ys = 'score.train'
# Gets scores from https://raw.githubusercontent.com/alvations/USAAR-SemEval-2015/master/task02-USAAR-SHEFFIELD/x.meteor.train
meteor_scores = np.array(get_meteor_scores(Xs))
# Gets scores from https://raw.githubusercontent.com/alvations/USAAR-SemEval-2015/master/task02-USAAR-SHEFFIELD/score.train
sts_scores = np.array(get_sts_scores(Ys))

x = meteor_scores
y = sts_scores
n = len(sts_scores)

# Linear Regression
lr = LinearRegression()
lr.fit(x[:, np.newaxis], y)

# Baysian Ridge Regression
br = BayesianRidge(compute_score=True)
br.fit(x[:, np.newaxis], y)

# Isotonic Regression
ir = IsotonicRegression()
y_ = ir.fit_transform(x, y)

# Gradient Boosting Regression
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 1,
          'learning_rate': 0.01, 'loss': 'ls'}
gbr = ensemble.GradientBoostingRegressor(**params)
gbr.fit(x[:, np.newaxis], y)

But how do I train regressors for Support Vector Regression , Gaussian Process and Decision Tree Regressor ? 但是,如何训练回归器进行Support Vector RegressionGaussian ProcessDecision Tree Regressor Support Vector Regression呢?


When i tried the following to train Support Vector Regressors , I get an error: 当我尝试以下方法来训练Support Vector Regressors ,出现错误:

from sklearn.svm import SVR
# Support Vector Regressions
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3, degree=2)
y_rbf = svr_rbf.fit(x, y)
y_lin = svr_lin.fit(x, y)
y_poly = svr_poly.fit(x, y)

[out]: [OUT]:

Traceback (most recent call last):
  File "/home/alvas/git/USAAR-SemEval-2015/task02-somethingLiddat/carolling.py", line 47, in <module>
    y_rbf = svr_rbf.fit(x, y)
  File "/home/alvas/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 10597.

The same happens when I tried Gaussian Process : 当我尝试Gaussian Process时, Gaussian Process发生相同的情况:

from sklearn.gaussian_process import GaussianProcess
# Gaussian Process
gp = GaussianProcess(corr='squared_exponential', theta0=1e-1,
                     thetaL=1e-3, thetaU=1,
                     random_start=100)
gp.fit(x, y)

[out]: [OUT]:

Traceback (most recent call last):
  File "/home/alvas/git/USAAR-SemEval-2015/task02-somethingLiddat/carolling.py", line 57, in <module>
    gp.fit(x, y)
  File "/home/alvas/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gaussian_process.py", line 271, in fit
    X, y = check_arrays(X, y)
  File "/home/alvas/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 254, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 10597. Expected 1

When running the gp.fit(x[:,np.newaxis], y) I get this error: 运行gp.fit(x[:,np.newaxis], y)此错误:

Traceback (most recent call last):
  File "/home/alvas/git/USAAR-SemEval-2015/task02-somethingLiddat/carolling.py", line 95, in <module>
    gp.fit(x[:,np.newaxis], y) 
  File "/home/alvas/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gaussian_process.py", line 301, in fit
    raise Exception("Multiple input features cannot have the same"
Exception: Multiple input features cannot have the same target value.

When I tried Decision Tree Regressor : 当我尝试Decision Tree Regressor

from sklearn.tree import DecisionTreeRegressor
# Decision Tree Regression
dtr2 = DecisionTreeRegressor(max_depth=2)
dtr5 = DecisionTreeRegressor(max_depth=2)
dtr2.fit(x,y)
dtr5.fit(x,y)

[out]: [OUT]:

Traceback (most recent call last):
  File "/home/alvas/git/USAAR-SemEval-2015/task02-somethingLiddat/carolling.py", line 47, in <module>
    dtr2.fit(x,y)
  File "/home/alvas/.local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 140, in fit
    n_samples, self.n_features_ = X.shape
ValueError: need more than 1 value to unpack

All these regressors require multidimensional x-array but your x-array is a 1D array. 所有这些回归器都需要多维x数组,但是您的x数组是一维数组。 So only requirement is to convert x-array into 2D array for these regressors to work. 因此,仅要求将x数组转换为2D数组即可使这些回归器起作用。 This can be achieved using x[:, np.newaxis] 这可以使用x[:, np.newaxis]来实现

Demo: 演示:

>>> from sklearn.svm import SVR
>>> # Support Vector Regressions
... svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
>>> svr_lin = SVR(kernel='linear', C=1e3)
>>> svr_poly = SVR(kernel='poly', C=1e3, degree=2)
>>> x=np.arange(10)
>>> y=np.arange(10)
>>> y_rbf = svr_rbf.fit(x[:,np.newaxis], y)  
>>> y_lin = svr_lin.fit(x[:,np.newaxis], y)
>>> svr_poly = svr_poly.fit(x[:,np.newaxis], y)
>>> from sklearn.gaussian_process import GaussianProcess
>>> # Gaussian Process
... gp = GaussianProcess(corr='squared_exponential', theta0=1e-1,
...                      thetaL=1e-3, thetaU=1,
...                      random_start=100)
>>> gp.fit(x[:, np.newaxis], y)
GaussianProcess(beta0=None,
        corr=<function squared_exponential at 0x7f46f3ebcf50>,
        normalize=True, nugget=array(2.220446049250313e-15),
        optimizer='fmin_cobyla', random_start=100,
        random_state=<mtrand.RandomState object at 0x7f4702d97150>,
        regr=<function constant at 0x7f46f3ebc8c0>, storage_mode='full',
        theta0=array([[ 0.1]]), thetaL=array([[ 0.001]]),
        thetaU=array([[1]]), verbose=False)
>>> from sklearn.tree import DecisionTreeRegressor
>>> # Decision Tree Regression
... dtr2 = DecisionTreeRegressor(max_depth=2)
>>> dtr5 = DecisionTreeRegressor(max_depth=2)
>>> dtr2.fit(x[:,np.newaxis],y)
DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=2,
           max_features=None, min_density=None, min_samples_leaf=1,
           min_samples_split=2, random_state=None, splitter='best')
>>> dtr5.fit(x[:,np.newaxis],y)
DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=2,
           max_features=None, min_density=None, min_samples_leaf=1,
           min_samples_split=2, random_state=None, splitter='best')

Preprocessing for GaussianProcess : GaussianProcess预处理:

xu = np.unique(x)  # get unique x values
idx = [np.where(x==x1)[0][0] for x1 in xu]  # get corresponding indices for unique x values
gp.fit(xu[:,np.newaxis], y[idx])  # y[idx] selects y values corresponding to unique x values

Multiple input features cannot have the same target value.

This means that one data point is repeating in your input data, and the Gaussian process does not allow for one data point to be listed twice. 这意味着输入数据中重复了一个数据点,并且高斯过程不允许将一个数据点列出两次。 Unfortunately, your dataset is no longer available, so I cannot check this, but that is what I think should be the case. 不幸的是,您的数据集不再可用,因此我无法进行检查,但我认为应该是这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM