简体   繁体   English

使用ScikitLearn进行多元线性回归,不同的方法给出不同的答案

[英]Multiple Linear Regression using ScikitLearn, different approaches give different answers

This is probably as equally valid on stats exchange as here (could be the stats or python that i'm not sure about. 这可能与此处的统计信息交换同样有效(可能是我不确定的统计信息或python。

Suppose I have two independent variables X,Y that explain some of the variance of Z . 假设我有两个自变量X,Y来解释Z一些方差。

    from sklearn.linear_model import LinearRegression
    import numpy as np
    from scipy.stats import pearsonr,linregress

    Z = np.array([1,3,5,6,7,8,9,7,10,9])

    X  = np.array([2,5,3,1,6,4,7,8,6,7])
    Y  = np.array([3,2,6,4,6,1,2,5,6,10])

I want to regress out the variability in X and Y from Z. There's two approaches that I know of: 我想从Z回归X和Y的可变性。我知道两种方法:

Regress out X from Z first (form a linear regression of X,Z, find the residual, then repeat for Y). 首先从Z回归X(形成X,Z的线性回归,找到残差,然后对Y重复)。 Such that: 这样:

    regr = linregress(X,Z) 
    resi_1 = NAO - (X*regr[0])+regr[1]  #residual = y-mx+c

    regr = linregress(Y,resi_1)
    resi_2 = resi_1 - (Y*regr[0])+regr[1] #residual = y-mx+c

Where regr_2 is the remainder of Z where X and Y have been sequentially regressed out. 其中regr_2是Z的其余部分,其中X和Y依次回归。

The alternative is to create a multiple linear regression model for X and Y predicting Z: 另一种方法是为X和Y创建一个预测Z的多元线性回归模型:

regr = LinearRegression()
Model = regr.fit(np.array((X,Y)).swapaxes(0,1),Z)

pred = Model.predict(np.array((X,Y)).swapaxes(0,1))
resi_3 = Z - pred

The residual from the first sequential approach resi_2 and the multiple linear regression resi_3 are very similar (correlation=0.97) but not equivalent. 第一个顺序方法resi_2和多元线性回归resi_3非常相似(相关性= 0.97),但不相等。 The two residuals are plotted below: 这两个残差如下图所示: 在此处输入图片说明

Any thoughts great (not a statistician so could be my understanding vs a python problem!). 任何伟大的想法(不是统计学家,所以我的理解可能是python问题!)。 Note if for the first part I regress out Y first, then X, I get different residuals. 请注意,如果在第一部分中我先回归Y,然后再回归X,则得到不同的残差。

Here is an example 3D graphical surface fitter using your data and scipy's curve_fit() routine with scatter, surface, and contour plots. 这是使用数据和scipy的curve_fit()例程以及散点图,曲面图和轮廓图的示例3D图形曲面拟合器。 You should be able to click-drag the 3D plots to rotate them in 3-space and see that the data does not appear to lie on any sort of smooth surface, so the flat plane model used here "z = (a *x) + (b * y) + c" is pretty much no better or worse than any other model for this data. 您应该能够单击并拖动3D图以在3维空间中旋转它们,并看到数据似乎不位于​​任何类型的光滑表面上,因此此处使用的平面模型“ z =(a * x) +(b * y)+ c”对于此数据而言,几乎没有任何其他模型更好或更差。

fitted prameters [ 0.65963199  0.18537117  2.43363301]
RMSE: 2.11487214206
R-squared: 0.383078044516

分散

表面

轮廓

import numpy, scipy, scipy.optimize
import matplotlib
from mpl_toolkits.mplot3d import  Axes3D
from matplotlib import cm # to colormap 3D surfaces from blue to red
import matplotlib.pyplot as plt

graphWidth = 800 # units are pixels
graphHeight = 600 # units are pixels

# 3D contour plot lines
numberOfContourLines = 16


def SurfacePlot(func, data, fittedParameters):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)

    matplotlib.pyplot.grid(True)
    axes = Axes3D(f)

    x_data = data[0]
    y_data = data[1]
    z_data = data[2]

    xModel = numpy.linspace(min(x_data), max(x_data), 20)
    yModel = numpy.linspace(min(y_data), max(y_data), 20)
    X, Y = numpy.meshgrid(xModel, yModel)

    Z = func(numpy.array([X, Y]), *fittedParameters)

    axes.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=1, antialiased=True)

    axes.scatter(x_data, y_data, z_data) # show data along with plotted surface

    axes.set_title('Surface Plot (click-drag with mouse)') # add a title for surface plot
    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label
    axes.set_zlabel('Z Data') # Z axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot or else there can be memory and process problems


def ContourPlot(func, data, fittedParameters):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    x_data = data[0]
    y_data = data[1]
    z_data = data[2]

    xModel = numpy.linspace(min(x_data), max(x_data), 20)
    yModel = numpy.linspace(min(y_data), max(y_data), 20)
    X, Y = numpy.meshgrid(xModel, yModel)

    Z = func(numpy.array([X, Y]), *fittedParameters)

    axes.plot(x_data, y_data, 'o')

    axes.set_title('Contour Plot') # add a title for contour plot
    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    CS = matplotlib.pyplot.contour(X, Y, Z, numberOfContourLines, colors='k')
    matplotlib.pyplot.clabel(CS, inline=1, fontsize=10) # labels for contours

    plt.show()
    plt.close('all') # clean up after using pyplot or else there can be memory and process problems


def ScatterPlot(data):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)

    matplotlib.pyplot.grid(True)
    axes = Axes3D(f)
    x_data = data[0]
    y_data = data[1]
    z_data = data[2]

    axes.scatter(x_data, y_data, z_data)

    axes.set_title('Scatter Plot (click-drag with mouse)')
    axes.set_xlabel('X Data')
    axes.set_ylabel('Y Data')
    axes.set_zlabel('Z Data')

    plt.show()
    plt.close('all') # clean up after using pyplot or else there can be memory and process problems


def func(data, a, b, c): # example flat surface
    x = data[0]
    y = data[1]
    return (a * x) + (b * y) + c


if __name__ == "__main__":

    xData = numpy.array([2.0, 5.0, 3.0, 1.0, 6.0, 4.0, 7.0, 8.0, 6.0, 7.0])
    yData = numpy.array([3.0, 2.0, 6.0, 4.0, 6.0, 1.0, 2.0, 5.0, 6.0, 10.0])
    zData = numpy.array([1.0, 3.0, 5.0, 6.0, 7.0, 8.0, 9.0, 7.0, 10.0, 9.0])

    data = [xData, yData, zData]

    initialParameters = [1.0, 1.0, 1.0] # these are the same as scipy default values in this example

    # here a non-linear surface fit is made with scipy's curve_fit()
    fittedParameters, pcov = scipy.optimize.curve_fit(func, [xData, yData], zData, p0 = initialParameters)

    ScatterPlot(data)
    SurfacePlot(func, data, fittedParameters)
    ContourPlot(func, data, fittedParameters)

    print('fitted prameters', fittedParameters)

    modelPredictions = func(data, *fittedParameters) 

    absError = modelPredictions - zData

    SE = numpy.square(absError) # squared errors
    MSE = numpy.mean(SE) # mean squared errors
    RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
    Rsquared = 1.0 - (numpy.var(absError) / numpy.var(zData))
    print('RMSE:', RMSE)
    print('R-squared:', Rsquared)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 具有 2 个特征的 Scikitlearn 线性回归 - Scikitlearn Linear Regression with 2 features Sklearn 和 StatsModels 给出了非常不同的逻辑回归答案 - Sklearn and StatsModels give very different logistic regression answers 使用Scikitlearn进行线性回归(线性回归) - Linear regression suing Scikitlearn(linear regression) 使用SciKitLearn Logistic回归 - using SciKitLearn Logistic Regression 为什么这些在Python中定义多个数组的方法给出不同的答案? - Why do these methods of defining multiple arrays in Python give different answers? ScikitLearn GridSearchCV 和管道使用不同的方法 - ScikitLearn GridSearchCV and pipeline using different methods statsmodel OLS和scikit线性回归之间的差异; 不同型号给出不同的r平方 - Difference between statsmodel OLS and scikit linear regression; different models give different r square 为什么在使用 TensorFlow 进行多元线性回归时会得到不同的权重? - Why do I get different weights when using TensorFlow for multiple linear regression? 使用scikitlearn在多元回归中访问y截距 - Access y-intercept in multiple regression using scikitlearn scikit learn kernel 岭回归使用线性 kernel 与简单线性回归产生非常不同的结果 - scikit learn kernel ridge regression produces very different result using linear kernel than simple linear regression
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM