使用python的线性回归使用sklearn

Question

我尝试使用python进行线性回归

例如

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]

现在example_data像这样：

inches city  Pizza_Price
  5       A        10
  6       B        12

英寸是一个清楚的数字，但面积不是。

如何将城市转换为数字进行计算？

如何将城市等参数分类为数字进行计算？

Answer 1

码：

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example

le.fit(cities)
cities = le.transform(cities)
print(cities)

输出：

[0，1，2，1]

标签编码器说明

Answer 2

您将要使用pd.get_dummies（）将每个城市转换为二进制值。 标签编码器将为变量分配一个整数值，这将使解释回归公式变得困难，甚至可能会产生偏差。 请记住删除一个虚拟变量，以避免多重共线性。

Answer 3

从example_data显示的数据看来，您正在使用Pandas DataFrame数据。 因此，我建议另一种可能的方法来回答您的问题

这是我以与您相同的格式生成的一些数据，但带有额外的行

d = [
    ['inches','city','Pizza_Price'],
    [5,'A',10],
    [6,'B',12],
    [7,'C',15],
    [8,'D',11],
    [9,'B',12],
    [10,'C',17],
    [11,'D',16]
    ]
df = pd.DataFrame(d[1:], columns=d[0])

print(df)
   inches city  Pizza_Price
0       5    A           10
1       6    B           12
2       7    C           15
3       8    D           11
4       9    B           12
5      10    C           17
6      11    D           16

根据@ Wen-Ben的建议，可以使用LabelEncoder （如本SO post所示）将city列转换为整数。

df['city'] = pd.DataFrame(columns=['city'],
                        data=LabelEncoder().fit_transform(
                            df['city'].values.flatten())
                        )

print(df)
   inches  city  Pizza_Price
0       5     0           10
1       6     1           12
2       7     2           15
3       8     3           11
4       9     1           12
5      10     2           17
6      11     3           16

步骤1.执行训练测试拆分以获取训练和测试数据X_train ， y_train等。

features = ['inches', 'city']
target = 'Pizza_Price'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42)

# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3

第2步。 （可选）在您的源DataFrame （ example_data ） DataFrame添加一列，以显示在培训和测试中使用了哪些行

df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'

步骤3.实例化LinearRegression模型并使用训练数据集训练模型-请参阅sklearn 文档中的此链接

model = LinearRegression()
model.fit(X_train, y_train)

步骤4.现在，对测试数据进行样本外预测，并（可选）将预测值作为单独的列附加到example_data

训练数据集中使用的行将没有预测，因此将被分配为NaN
测试数据集中使用的行将具有预测

df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)

print(df)
   inches  city  Pizza_Price   Type  Predicted_Pizza_Price
0       5     0           10   Test                   11.0
1       6     1           12   Test                   11.8
2       7     2           15  Train                    NaN
3       8     3           11  Train                    NaN
4       9     1           12  Train                    NaN
5      10     2           17   Test                   14.0
6      11     3           16  Train                    NaN

步骤5.生成模型评估指标（请参阅第15点，从此处开始）

我们将生成一个Pandas DataFrame其中同时显示（a）模型评估指标和（b）模型属性-线性回归系数和截距
我们将首先生成一个包含所有这些值的Python字典，然后将其转换为Pandas DataFrame

创建一个空白字典来保存模型属性（系数，截距）和评估指标

dict_summary = {}

附加系数并拦截到字典

for m,feature in enumerate(features):
    dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_

将评估指标附加到字典

y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
                                                                y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
                                                                y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
                                    metrics.mean_squared_error(y_test, y_pred)
                                                        )

将字典转换为摘要DataFrame显示回归模型属性和评估指标

df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)

模型评估DataFrame输出

print(df_metrics)
                           metric     value
0            Coefficient (inches)  0.466667
1              Coefficient (city)  0.333333
2                       Intercept  8.666667
3       Mean Absolute Error (MAE)  1.400000
4        Mean Squared Error (MSE)  3.346667
5  Root Mean Squared Error (RMSE)  1.829390

使用这种方法，由于您在Pandas 2 DataFrame获得了结果，因此可以使用Pandas工具可视化回归分析的结果。

使用python的线性回归使用sklearn

问题描述

3 个解决方案

解决方案1
0 已采纳 2019-04-24 02:14:30

解决方案2
0 2019-04-24 02:26:42

解决方案3
0 2019-04-24 15:12:27

使用python的线性回归使用sklearn

问题描述

3 个解决方案

解决方案1 0 已采纳 2019-04-24 02:14:30

解决方案2 0 2019-04-24 02:26:42

解决方案3 0 2019-04-24 15:12:27

解决方案1
0 已采纳 2019-04-24 02:14:30

解决方案2
0 2019-04-24 02:26:42

解决方案3
0 2019-04-24 15:12:27