简体   繁体   English

如何传递单个列因变量来训练线性回归模型?

[英]How to pass the single columnar dependent variable to train the linear regression model?

I am new to Machine learning and started course on Simple Linear Regression model recently.我是机器学习的新手,最近开始学习简单线性回归模型。

I have a dataset where except for a column id (integer type), all the columns are of String datatype.我有一个数据集,除了列id (整数类型)之外,所有列都是String数据类型。 And I have loaded it into a pandas dataframe and selected indexes out of it as below.我已将其加载到 pandas 数据框中并从中选择索引,如下所示。

The pandas dataframe has total 32 columns and the 33rd column is the dependent variable column that just says YES or NO . pandas 数据框共有 32 列,第 33 列是因变量列,仅显示YESNO Using all the independent variables (columns 0 to 31), I am trying to find if I can predict the values in column 32 which is my dependent variable.使用所有自变量(第 0 列到第 31 列),我试图找出是否可以预测第 32 列中的值,这是我的因变量。

data = psyco.read_into_pandas()
X = data.iloc[:, 1:33].values
Y = data.iloc[:, 32].values

# Add missing values
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent', add_indicator=True)

# Fit the rows and columns into the imputer
imputer.fit(X[:, 1:33])

# Transform the data.
X[:, 1:33] = imputer.transform(X[:, 1:33])

# One hot encoding
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

# Label Encoder
le = LabelEncoder()
Y = le.fit_transform(Y)

# Split data into train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Before sending the values of X_train and Y_train , I just printed the values of Y_train and I can see that it contains an array of integers which could be seen in the image below.在发送X_trainY_train的值之前,我只打印了Y_train的值,我可以看到它包含一个整数数组,如下图所示。

在此处输入图像描述

But when I send the data of X_train and Y_train to my LinearRegression() I am facing an error that says:但是当我将X_trainY_train的数据发送到我的LinearRegression()时,我遇到了一个错误,上面写着:

ValueError: could not convert string to float: 'yes'

Full error:完整错误:

Traceback (most recent call last):
  File "/Some/Path/mltask.py", line 52, in task_2
    lr.fit(X_train, Y_train)
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/linear_model/_base.py", line 684, in fit
    X, y = self._validate_data(
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/base.py", line 596, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1074, in check_X_y
    X = check_array(
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 856, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'yes'

What I don't understand is when I print Y_train I see integers in the array but the regression says it can't convert String to float.我不明白的是,当我打印Y_train时,我在数组中看到整数,但回归表明它无法将字符串转换为浮点数。

Could anyone let me know if I missed any step in between and how can I correct my mistake ?如果我错过了中间的任何步骤,谁能告诉我,我该如何纠正我的错误? Any help is massively appreciated.非常感谢任何帮助。

I think your untransformed data might be in X_train, not Y_train.我认为您未转换的数据可能在 X_train 中,而不是 Y_train 中。

Explanation : You split your data the following way :说明:您按以下方式拆分数据:

X = data.iloc[:, 1:33].values
Y = data.iloc[:, 32].values

This means that Y is included in X .这意味着Y包含在X中。 In python indices start at 0 so if your dependent variable is the last column, you want to split like this instead :在 python 中,索引从 0 开始,所以如果你的因变量是最后一列,你想像这样拆分:

X = data.iloc[:, 0:32].values
Y = data.iloc[:, 32]

Otherwise you will try to predict Y with Y being in the feature set which you would not need Linear Regression to achieve.否则,您将尝试预测 Y,其中 Y 在您不需要线性回归来实现的特征集中。

In the rest of your code the extra column in X is not passed through the OneHotEncoder (because you pass columns 1 to 31) which results in you having some "yes", "no" data in X_train在其余代码中,X 中的额外列未通过OneHotEncoder传递(因为您传递了第 1 到 31 列),这导致您在 X_train 中有一些“是”、“否”数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为线性回归和训练模型创建 tf.data.Datasets - How to create a tf.data.Datasets for linear regression and train model Tensorflow:如何训练线性回归? - Tensorflow: How to train a Linear Regression? 如何为单个输入和多个 output 训练回归 model? - How to train a Regression model for single input and multiple output? 如何使用线性回归模型产生单个预测值? - How to use a linear regression model to produce a single prediction value? 如何将值从列表传递到 scikit 学习线性回归 model? - How to pass values from list to scikit learn linear regression model? 如何训练在线回归模型 - How to train an online regression model 当新数据到来时如何在pyspark中重新训练保存的线性回归ML模型 - how to re-train Saved linear regression ML model in pyspark when new data is coming 如何训练具有张量流的简单非线性回归模型? - How can I train a simple, non-linear regression model with tensor flow? 如何在python中从多个自变量和一个因变量绘制图[多重线性回归] - How to Plot graph from multiple independent variable and one dependent variable in python [Multiple linear regression] 线性回归是否适用于分类自变量和连续因变量? - Does linear regression work with a categorical independent variable & continuous dependent variable?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM