简体   繁体   English

使用python的线性回归使用sklearn

[英]Linear regression with python use sklearn

i try to do Linear regression with python 我尝试使用python进行线性回归

For example 例如

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]

Now the example_data like: 现在example_data像这样:

inches city  Pizza_Price
  5       A        10
  6       B        12

inches is a clear number but area is not. 英寸是一个清楚的数字,但面积不是。

How can I convert a city to a number for calculation? 如何将城市转换为数字进行计算?

How to classify parameters like city into numbers for calculation? 如何将城市等参数分类为数字进行计算?

Code: 码:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example

le.fit(cities)
cities = le.transform(cities)
print(cities)

Output: 输出:

[0, 1, 2, 1] [0,1,2,1]

Label Encoder explanation 标签编码器说明

You'll want to use pd.get_dummies() to convert each city into a binary value. 您将要使用pd.get_dummies()将每个城市转换为二进制值。 Label encoder will assign an integer value to the variable which will make interpreting the regression formula difficult and possibly skewed. 标签编码器将为变量分配一个整数值,这将使解释回归公式变得困难,甚至可能会产生偏差。 Remember to drop one of the dummy variables to avoid multi-collinearity. 请记住删除一个虚拟变量,以避免多重共线性。

From your data shown in example_data , it looks like you are working with data in a Pandas DataFrame . example_data显示的数据看来,您正在使用Pandas DataFrame数据。 So, I would suggest another possible approach to answering your question 因此,我建议另一种可能的方法来回答您的问题

Here is some data I generated in the same format as yours but with extra rows 这是我以与您相同的格式生成的一些数据,但带有额外的行

d = [
    ['inches','city','Pizza_Price'],
    [5,'A',10],
    [6,'B',12],
    [7,'C',15],
    [8,'D',11],
    [9,'B',12],
    [10,'C',17],
    [11,'D',16]
    ]
df = pd.DataFrame(d[1:], columns=d[0])

print(df)
   inches city  Pizza_Price
0       5    A           10
1       6    B           12
2       7    C           15
3       8    D           11
4       9    B           12
5      10    C           17
6      11    D           16

The conversion of the city column into integers can be done using LabelEncoder (as shown in this SO post ), per @Wen-Ben's suggestion 根据@ Wen-Ben的建议 ,可以使用LabelEncoder (如本SO post所示)将city列转换为整数。

df['city'] = pd.DataFrame(columns=['city'],
                        data=LabelEncoder().fit_transform(
                            df['city'].values.flatten())
                        )

print(df)
   inches  city  Pizza_Price
0       5     0           10
1       6     1           12
2       7     2           15
3       8     3           11
4       9     1           12
5      10     2           17
6      11     3           16

Step 1. Perform the train-test split to get the training and testing data X_train , y_train , etc. 步骤1.执行训练测试拆分以获取训练和测试数据X_trainy_train等。

features = ['inches', 'city']
target = 'Pizza_Price'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42)

# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3

Step 2. (Optional) Append a column to your source DataFrame ( example_data ) that shows which rows are used in training and testing 第2步。 (可选)在您的源DataFrameexample_dataDataFrame添加一列,以显示在培训和测试中使用了哪些行

df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'

Step 3. Instantiate the LinearRegression model and train the model using the training dataset - see this link from sklearn docs 步骤3.实例化LinearRegression模型并使用训练数据集训练模型-请参阅sklearn 文档中的此链接

model = LinearRegression()
model.fit(X_train, y_train)

Step 4. Now, make out-of-sample predictions on the testing data and (optionally) append the predicted values as a separate column to the example_data 步骤4.现在,对测试数据进行样本外预测,并(可选)将预测值作为单独的列附加到example_data

  • the rows used in the training dataset will have no prediction so they will be assigned NaN 训练数据集中使用的行将没有预测,因此将被分配为NaN
  • the rows used in the testing dataset will have a prediction 测试数据集中使用的行将具有预测
df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)

print(df)
   inches  city  Pizza_Price   Type  Predicted_Pizza_Price
0       5     0           10   Test                   11.0
1       6     1           12   Test                   11.8
2       7     2           15  Train                    NaN
3       8     3           11  Train                    NaN
4       9     1           12  Train                    NaN
5      10     2           17   Test                   14.0
6      11     3           16  Train                    NaN

Step 5. Generate model evaluation metrics (see point number 15. from here ) 步骤5.生成模型评估指标(请参阅第15点,从此处开始

  • we will generate a Pandas DataFrame showing both the (a) model evaluation metrics and (b) model properties - the Linear Regression coefficients and the intercept 我们将生成一个Pandas DataFrame其中同时显示(a)模型评估指标和(b)模型属性-线性回归系数和截距
  • we will first generate a Python dictionary that contains all these values and then convert the dictionary to a Pandas DataFrame 我们将首先生成一个包含所有这些值的Python字典,然后转换为Pandas DataFrame

Create a blank dictionary to hold the model properties (coefficient, intercept) and evaluation metrics 创建一个空白字典来保存模型属性(系数,截距)和评估指标

dict_summary = {}

Append coefficient and intercept to dictionary 附加系数并拦截到字典

for m,feature in enumerate(features):
    dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_

Append evaluation metrics to dictionary 将评估指标附加到字典

y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
                                                                y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
                                                                y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
                                    metrics.mean_squared_error(y_test, y_pred)
                                                        )

Convert dictionary into summary DataFrame showing regression model properties and evaluation metrics 将字典转换为摘要DataFrame显示回归模型属性和评估指标

df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)

Output of model evaluation DataFrame 模型评估DataFrame输出

print(df_metrics)
                           metric     value
0            Coefficient (inches)  0.466667
1              Coefficient (city)  0.333333
2                       Intercept  8.666667
3       Mean Absolute Error (MAE)  1.400000
4        Mean Squared Error (MSE)  3.346667
5  Root Mean Squared Error (RMSE)  1.829390

Using this approach, since you have results in Pandas 2 DataFrame s, Pandas tools can be used to visualize the results of the regression analysis. 使用这种方法,由于您在Pandas 2 DataFrame获得了结果,因此可以使用Pandas工具可视化回归分析的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM