简体   繁体   中英

Why Gradient Boosting not working in Linear Regression?

Please help me understand why Gradient Boosting technique is not working. Is it that GB uses an Decision Tree Regression internally [confusion please clarify]. I am trying ensemble technique to get the best score for the current dataset. Also, there seems to be an issue with Recursive Feature Elimination [RFE], corelation matrix intuitions and RFE from SKLearn should yield similar feature importance. Please help me understand, Recursive Feature Elimination [RFE], correlation matrix intuitions and RFE from SKLearn are not giving similar feature importance.

from IPython.display import clear_output
from io import StringIO
import pandas as pd
import requests
import numpy as np
import matplotlib.pyplot as plt

url='https://raw.githubusercontent.com/saqibmujtaba/Machine-
Learning/DataFiles/50_Startups.csv'

s=requests.get(url).text
dataset=pd.read_csv(StringIO(s))

Co-Relation Matrix clearly suggests that R&D Spend is have highest significance to predict Profit [Label], followed by Marketing spend?

from pandas.tools.plotting import scatter_matrix
scatter_matrix(dataset)
plt.show()

互相关散点图矩阵

# Create Independent Variable 
X=dataset.iloc[:,:-1].values 

# Dependent Variable              
Y=dataset.iloc[:,4].values

Applying Label Encoding

labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])

Clearly, LabelEncoding is working.

Output

[[165349.2 136897.8 471784.1 2L]
[162597.7 151377.59 443898.53 0L]
[153441.51 101145.55 407934.54 1L]
[144372.41 118671.85 383199.62 2L]
[142107.34 91391.77 366168.42 1L]]

Trying One Hot Encoding ,

onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
print(X[0:5,:])

Output

[[ 0  0  1  165349  136898  471784]
[ 1  0  0  162598  151378  443899]
[ 0  1  0  153442  101146  407935]
[ 0  0  1  144372  118672  383200]
[ 0  1  0  142107  91392  366168]]

Avoiding Dummy Variable trap and Feature Scaling

X = X[:, 1:]
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
print(X[0:5,:])

Output

[[ 0  1  165349  136898  471784]
[ 0  0  162598  151378  443899]
[ 1  0  153442  101146  407935]
[ 0  1  144372  118672  383200]
[ 1  0  142107  91392  366168]]

Firstly, Even if R&D spend is correctly given , it should be followed by marketing spend? Also, why is Profit feature part of selection as i have clearly passed Y as label in linear regression fit?. Am i missing something?

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# feature extraction
# Rank all features, i.e continue the elimination until the last one
rfe = RFE(estimator=lr, n_features_to_select=1)
fit = rfe.fit(X,Y)
print("Num Features: %d") % fit.n_features_
# an array with boolean values to indicate whether an attribute was selected 
using RFE
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_

names = dataset.columns.values
print names
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))

Output

Num Features: 1
Selected Features: [ True False False False False]
Feature Ranking: [1 2 3 4 5]
['R&D Spend' 'Administration' 'Marketing Spend' 'State' 'Profit']
Features sorted by their rank:
[(1, 'R&D Spend'), (2, 'Administration'), (3, 'Marketing Spend'), (4, 
'State'), (5, 'Profit')]

I tried this out for Boston data and it seems to be working. Has Scaling caused an issue here? Can you please help me understand what kind of scaling should be applied and how would i determine that in my future tasks ?

sc_X = StandardScaler().fit(X)
rescaledX = sc_X.fit_transform(X)


# Transform the Y based on the X Fittings.
rescaledY = sc_X.transform(Y)

# Using KFold 

from sklearn.model_selection import KFold
kfold =KFold(n_splits=5,random_state=1)

Choosing Boosting Model and Cross Validation

from sklearn.model_selection import cross_val_score

model = GradientBoostingRegressor(n_estimators=100, random_state=1)

results = cross_val_score(model, rescaledX, rescaledY, cv=kfold)
print(results)

[-5.28213131 -2.73927962 -7.55241606 -2.5951924 -2.51933385]

i am not able to understand, what is result giving. I thought it should give the average score of my model - Please correct

When gradient boosting is done along with linear regression, it is nothing more than another linear model over the existing linear model. This can intuitively be understood as adding something to the already found coefficients, and if the linear regression has already found the best coefficients it will be of no use.

There are two advantages of boosting methods with linear regression, first being able to regularise the values of coefficients and helping in the case of overfitting. Second being when data is of some non linear complex shape. Boosting methods helps it to evolve slowly with data.

One more aspect to your question. If you are looking for ensembling methods for linear regression to make use of many models at a time, you can look for regularised regression using packages like glmnet . You can predict using many different models and average their predictions.

It is just that linear regression isn't appropriate for Gradient Boosting.

GB works this way: model is fitted on data, then the next model is build on residuals of previous model. But usually residuals of linear models can't be fitted with another linear model.

And also if you build a lot of subsequent linear models, they still can be represented as a single linear model (adding all intercepts and coefficients).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM