如何传递单个列因变量来训练线性回归模型？

Question

I am new to Machine learning and started course on Simple Linear Regression model recently.我是机器学习的新手，最近开始学习简单线性回归模型。

I have a dataset where except for a column id (integer type), all the columns are of String datatype.我有一个数据集，除了列id （整数类型）之外，所有列都是String数据类型。 And I have loaded it into a pandas dataframe and selected indexes out of it as below.我已将其加载到 pandas 数据框中并从中选择索引，如下所示。

The pandas dataframe has total 32 columns and the 33rd column is the dependent variable column that just says YES or NO . pandas 数据框共有 32 列，第 33 列是因变量列，仅显示YES或NO 。 Using all the independent variables (columns 0 to 31), I am trying to find if I can predict the values in column 32 which is my dependent variable.使用所有自变量（第 0 列到第 31 列），我试图找出是否可以预测第 32 列中的值，这是我的因变量。

data = psyco.read_into_pandas()
X = data.iloc[:, 1:33].values
Y = data.iloc[:, 32].values

# Add missing values
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent', add_indicator=True)

# Fit the rows and columns into the imputer
imputer.fit(X[:, 1:33])

# Transform the data.
X[:, 1:33] = imputer.transform(X[:, 1:33])

# One hot encoding
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

# Label Encoder
le = LabelEncoder()
Y = le.fit_transform(Y)

# Split data into train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Before sending the values of X_train and Y_train , I just printed the values of Y_train and I can see that it contains an array of integers which could be seen in the image below.在发送X_train和Y_train的值之前，我只打印了Y_train的值，我可以看到它包含一个整数数组，如下图所示。

But when I send the data of X_train and Y_train to my LinearRegression() I am facing an error that says:但是当我将X_train和Y_train的数据发送到我的LinearRegression()时，我遇到了一个错误，上面写着：

ValueError: could not convert string to float: 'yes'

Full error:完整错误：

Traceback (most recent call last):
  File "/Some/Path/mltask.py", line 52, in task_2
    lr.fit(X_train, Y_train)
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/linear_model/_base.py", line 684, in fit
    X, y = self._validate_data(
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/base.py", line 596, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1074, in check_X_y
    X = check_array(
  File "/Some/Path/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 856, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'yes'

What I don't understand is when I print Y_train I see integers in the array but the regression says it can't convert String to float.我不明白的是，当我打印Y_train时，我在数组中看到整数，但回归表明它无法将字符串转换为浮点数。

Could anyone let me know if I missed any step in between and how can I correct my mistake ?如果我错过了中间的任何步骤，谁能告诉我，我该如何纠正我的错误？ Any help is massively appreciated.非常感谢任何帮助。

Answer 1

I think your untransformed data might be in X_train, not Y_train.我认为您未转换的数据可能在 X_train 中，而不是 Y_train 中。

Explanation : You split your data the following way :说明：您按以下方式拆分数据：

X = data.iloc[:, 1:33].values
Y = data.iloc[:, 32].values

This means that Y is included in X .这意味着Y包含在X中。 In python indices start at 0 so if your dependent variable is the last column, you want to split like this instead :在 python 中，索引从 0 开始，所以如果你的因变量是最后一列，你想像这样拆分：

X = data.iloc[:, 0:32].values
Y = data.iloc[:, 32]

Otherwise you will try to predict Y with Y being in the feature set which you would not need Linear Regression to achieve.否则，您将尝试预测 Y，其中 Y 在您不需要线性回归来实现的特征集中。

In the rest of your code the extra column in X is not passed through the OneHotEncoder (because you pass columns 1 to 31) which results in you having some "yes", "no" data in X_train在其余代码中，X 中的额外列未通过OneHotEncoder传递（因为您传递了第 1 到 31 列），这导致您在 X_train 中有一些“是”、“否”数据

如何传递单个列因变量来训练线性回归模型？

问题描述

1 个解决方案

解决方案1
0 2022-06-03 12:59:59

如何传递单个列因变量来训练线性回归模型？

问题描述

1 个解决方案

解决方案1 0 2022-06-03 12:59:59

解决方案1
0 2022-06-03 12:59:59