简体   繁体   English

用于管道中线性回归的 PCA 到 select 功能

[英]PCA to select features for Linear regression in Pipeline

I have a dataset with some numeric and categorical variables.我有一个包含一些数字和分类变量的数据集。 I tried to preprocess categorical variables with pandas dummies in order to scale the data with StandardScaler, however, some columns also have missing values (mostly categorical) so I used imputer in the pipeline though it still generates the error我尝试使用 pandas 假人预处理分类变量,以便使用 StandardScaler 缩放数据,但是,有些列也有缺失值(主要是分类),所以我在管道中使用 imputer 虽然它仍然会产生错误

"Input contains NaN, infinity or a value too large for dtype('float64')".

My code to preprocess the data is shown below.我的数据预处理代码如下所示。

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

y = df["SalePrice"]
X = df.drop(["SalePrice", "PoolQC"], axis = 1)
X_dummies = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.30, random_state=0)

pipeline = Pipeline([("imputer", SimpleImputer(missing_values='NaN', strategy='most_frequent')), ('scaling', StandardScaler()),('pca', PCA(n_components=157, whiten=True)), ('regr',LinearRegression())])
pipeline.fit(X_train, y_train)

The dataset df has the following columns.数据集 df 具有以下列。

df.info()


Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Am I on the right way or not?我走对了吗? How can I deal with this error?我该如何处理这个错误? Can I implement it somehow in the pipeline?我可以在管道中以某种方式实现它吗?

yes, and you can access the pca by pipeline['PCA']是的,您可以通过 pipeline['PCA'] 访问 pca

   df=pd.read_csv('https://raw.githubusercontent.com/eric-bunch/boston_housing/master/boston.csv')


y = df['MDEV'].astype(float)*1000

X = df.drop(["MDEV"], axis = 1)
X_dummies = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.30, random_state=0)

pipeline = Pipeline([("imputer", SimpleImputer(strategy='most_frequent')), ('scaling', StandardScaler()),('pca', PCA(n_components=13,whiten=True)), ('regr',LinearRegression())])
pipeline.fit(X_train, y_train)

pca = pipeline.named_steps['pca']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM