簡體   English   中英

Python 線性回歸的瘋狂高系數

[英]Python Crazy High Coefficients With Linear Regression

所以我在這個數據集上做了一個線性回歸 model: https://www.kaggle.com/shree1992/housedata

在我清理並構建我的 model 之后,我得到了一些我沒想到的瘋狂高系數。

我用谷歌搜索了這個問題,基於此我做了一個嶺回歸,它確實修復了瘋狂的系數,但分數和 MAE 幾乎相同(線性回歸分數最好,盡管在兩個 MAE + 分數中)這表明它不是由於過度擬合所以為什么我會得到這些高系數,我該如何解釋/解釋它們? 提前致謝.. 下面是我的系數和代碼。

系數:

sqft_living  :: -20531660933516.066
floors  :: -46157.99116169465
bedrooms  :: -35148.64994889144
yr_built  :: -110.275390625
sqft_lot  :: -0.01336842838432517
yr_renovated  :: 13901.669921875
bathrooms  :: 22068.444163259817
condition  :: 28854.36132510344
view  :: 54609.32181396632
waterfront  :: 619987.8770517551
statezip_WA 98070  :: 51720518.26940918
statezip_WA 98023  :: 51733793.98413086
statezip_WA 98198  :: 51745527.19320679
statezip_WA 98092  :: 51753612.19506836
statezip_WA 98003  :: 51768969.80859375
statezip_WA 98057  :: 51774754.2020874
statezip_WA 98032  :: 51777293.54980469
statezip_WA 98188  :: 51780926.42871094
statezip_WA 98022  :: 51785464.6875
statezip_WA 98042  :: 51788032.485961914
statezip_WA 98001  :: 51798657.185058594
statezip_WA 98030  :: 51800982.91894531
statezip_WA 98002  :: 51807063.37084961
statezip_WA 98038  :: 51818086.75805664
statezip_WA 98058  :: 51818726.060058594
statezip_WA 98031  :: 51820966.17700195
statezip_WA 98055  :: 51836975.10852051
statezip_WA 98178  :: 51839662.78881836
statezip_WA 98059  :: 51845304.94116211
statezip_WA 98019  :: 51849298.035583496
statezip_WA 98065  :: 51858962.752441406
statezip_WA 98014  :: 51862571.193847656
statezip_WA 98148  :: 51872288.3659668
statezip_WA 98166  :: 51878712.109375
statezip_WA 98056  :: 51890492.997558594
statezip_WA 98045  :: 51890671.47558594
statezip_WA 98168  :: 51909556.58944702
statezip_WA 98146  :: 51923932.966308594
statezip_WA 98011  :: 51925708.75717163
statezip_WA 98028  :: 51930531.6730957
statezip_WA 98155  :: 51933038.31750488
statezip_WA 98024  :: 51933207.13555908
statezip_WA 98108  :: 51935337.22363281
statezip_WA 98077  :: 51937928.41999817
statezip_WA 98072  :: 51939094.63574219
statezip_WA 98106  :: 51946079.88293457
statezip_WA 98027  :: 51954189.55102539
statezip_WA 98133  :: 51968441.83276367
statezip_WA 98118  :: 51972078.98779297
statezip_WA 98074  :: 51972640.670410156
statezip_WA 98125  :: 51985392.0078125
statezip_WA 98034  :: 51989931.86279297
statezip_WA 98053  :: 51994949.201171875
statezip_WA 98075  :: 51996895.56713867
statezip_WA 98126  :: 52003476.768066406
statezip_WA 98008  :: 52019588.31152344
statezip_WA 98029  :: 52033227.60961914
statezip_WA 98177  :: 52044918.458618164
statezip_WA 98136  :: 52054739.052734375
statezip_WA 98052  :: 52055053.704589844
statezip_WA 98006  :: 52077050.865234375
statezip_WA 98007  :: 52084987.728515625
statezip_WA 98144  :: 52104137.84765625
statezip_WA 98116  :: 52123261.3046875
statezip_WA 98033  :: 52128846.232666016
statezip_WA 98115  :: 52137801.478027344
statezip_WA 98117  :: 52140383.259521484
statezip_WA 98005  :: 52147522.69140625
statezip_WA 98122  :: 52159159.841552734
statezip_WA 98103  :: 52160013.99584961
statezip_WA 98107  :: 52176913.24609375
statezip_WA 98199  :: 52218928.334228516
statezip_WA 98102  :: 52277970.43017578
statezip_WA 98040  :: 52319189.98120117
statezip_WA 98119  :: 52323874.4597168
statezip_WA 98105  :: 52360431.115722656
statezip_WA 98109  :: 52381532.43066406
statezip_WA 98112  :: 52410056.1015625
statezip_WA 98004  :: 52665837.48083496
statezip_WA 98039  :: 52891510.521728516
sqft_basement  :: 20531660933682.504
sqft_above  :: 20531660933785.93

代碼

houses_preprocessed = houses[
(houses.price<1.2*10**7) &
(houses.bedrooms>0) &
(houses.bedrooms <= 6) &
(houses.bathrooms>0) &
(houses.price>8000)].drop(columns=['country', 'date', 'street', 'city'])

houses_preprocessed.loc[houses_preprocessed['yr_renovated'] < 1, 'yr_renovated'] = 0
houses_preprocessed.loc[houses_preprocessed['yr_renovated'] > 1, 'yr_renovated'] = 1

toremove = houses_preprocessed['statezip'].value_counts()
houses_preprocessed=houses_preprocessed[houses_preprocessed.isin(toremove.index[toremove > 10]).values]

X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

reg = LinearRegression()
reg.fit(X_train, y_train)

你遇到的是多重共線性 如果你的兩個或多個預測變量高度相關,回歸模型只需要使用其中一個,其他的將被設置為一些無意義的值。 如果你看數據:

X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']

import seaborn as sns

sns.clustermap(X.select_dtypes("number").corr(method="spearman"),figsize=(6, 6))

在此處輸入圖像描述

這三個變量高度相關:

sns.pairplot(X[['bathrooms','sqft_above','sqft_living']])

在此處輸入圖像描述

所以我們保留其中一個,最后,因為你做了一個hot,你不能適合一個截距,否則一個hot statezip將是你的截距的線性組合:

X = pd.get_dummies(X.drop(columns=['bathrooms','sqft_above']))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg = LinearRegression(fit_intercept=False)
reg.fit(X_train, y_train)

檢查 r2:

reg.score(X_test,y_test)
0.7621069304476887

考慮到 y 值的范圍,系數現在看起來不錯:

res = pd.DataFrame({'coef':reg.coef_},index=X.columns)
res.reindex(res.coef.abs().sort_values().index)


coef
sqft_lot    -0.023554
yr_built    54.699771
sqft_basement   -100.401752
sqft_living 278.836773
statezip_WA 98006   565.521930
... ...
statezip_WA 98023   -342256.082284
statezip_WA 98070   -353819.063160
statezip_WA 98004   589945.748620
waterfront  621313.209967
statezip_WA 98039   816056.566554

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM