[英]PCA to select features for Linear regression in Pipeline
我有一個包含一些數字和分類變量的數據集。 我嘗試使用 pandas 假人預處理分類變量,以便使用 StandardScaler 縮放數據,但是,有些列也有缺失值(主要是分類),所以我在管道中使用 imputer 雖然它仍然會產生錯誤
"Input contains NaN, infinity or a value too large for dtype('float64')".
我的數據預處理代碼如下所示。
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
y = df["SalePrice"]
X = df.drop(["SalePrice", "PoolQC"], axis = 1)
X_dummies = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.30, random_state=0)
pipeline = Pipeline([("imputer", SimpleImputer(missing_values='NaN', strategy='most_frequent')), ('scaling', StandardScaler()),('pca', PCA(n_components=157, whiten=True)), ('regr',LinearRegression())])
pipeline.fit(X_train, y_train)
數據集 df 具有以下列。
df.info()
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
我走對了嗎? 我該如何處理這個錯誤? 我可以在管道中以某種方式實現它嗎?
是的,您可以通過 pipeline['PCA'] 訪問 pca
df=pd.read_csv('https://raw.githubusercontent.com/eric-bunch/boston_housing/master/boston.csv')
y = df['MDEV'].astype(float)*1000
X = df.drop(["MDEV"], axis = 1)
X_dummies = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.30, random_state=0)
pipeline = Pipeline([("imputer", SimpleImputer(strategy='most_frequent')), ('scaling', StandardScaler()),('pca', PCA(n_components=13,whiten=True)), ('regr',LinearRegression())])
pipeline.fit(X_train, y_train)
pca = pipeline.named_steps['pca']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.