使用python的線性回歸使用sklearn

Question

我嘗試使用python進行線性回歸

例如

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]

現在example_data像這樣：

inches city  Pizza_Price
  5       A        10
  6       B        12

英寸是一個清楚的數字，但面積不是。

如何將城市轉換為數字進行計算？

如何將城市等參數分類為數字進行計算？

Answer 1

碼：

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example

le.fit(cities)
cities = le.transform(cities)
print(cities)

輸出：

[0，1，2，1]

標簽編碼器說明

Answer 2

您將要使用pd.get_dummies（）將每個城市轉換為二進制值。 標簽編碼器將為變量分配一個整數值，這將使解釋回歸公式變得困難，甚至可能會產生偏差。 請記住刪除一個虛擬變量，以避免多重共線性。

Answer 3

從example_data顯示的數據看來，您正在使用Pandas DataFrame數據。 因此，我建議另一種可能的方法來回答您的問題

這是我以與您相同的格式生成的一些數據，但帶有額外的行

d = [
    ['inches','city','Pizza_Price'],
    [5,'A',10],
    [6,'B',12],
    [7,'C',15],
    [8,'D',11],
    [9,'B',12],
    [10,'C',17],
    [11,'D',16]
    ]
df = pd.DataFrame(d[1:], columns=d[0])

print(df)
   inches city  Pizza_Price
0       5    A           10
1       6    B           12
2       7    C           15
3       8    D           11
4       9    B           12
5      10    C           17
6      11    D           16

根據@ Wen-Ben的建議，可以使用LabelEncoder （如本SO post所示）將city列轉換為整數。

df['city'] = pd.DataFrame(columns=['city'],
                        data=LabelEncoder().fit_transform(
                            df['city'].values.flatten())
                        )

print(df)
   inches  city  Pizza_Price
0       5     0           10
1       6     1           12
2       7     2           15
3       8     3           11
4       9     1           12
5      10     2           17
6      11     3           16

步驟1.執行訓練測試拆分以獲取訓練和測試數據X_train ， y_train等。

features = ['inches', 'city']
target = 'Pizza_Price'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42)

# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3

第2步。 （可選）在您的源DataFrame （ example_data ） DataFrame添加一列，以顯示在培訓和測試中使用了哪些行

df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'

步驟3.實例化LinearRegression模型並使用訓練數據集訓練模型-請參閱sklearn 文檔中的此鏈接

model = LinearRegression()
model.fit(X_train, y_train)

步驟4.現在，對測試數據進行樣本外預測，並（可選）將預測值作為單獨的列附加到example_data

訓練數據集中使用的行將沒有預測，因此將被分配為NaN
測試數據集中使用的行將具有預測

df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)

print(df)
   inches  city  Pizza_Price   Type  Predicted_Pizza_Price
0       5     0           10   Test                   11.0
1       6     1           12   Test                   11.8
2       7     2           15  Train                    NaN
3       8     3           11  Train                    NaN
4       9     1           12  Train                    NaN
5      10     2           17   Test                   14.0
6      11     3           16  Train                    NaN

步驟5.生成模型評估指標（請參閱第15點，從此處開始）

我們將生成一個Pandas DataFrame其中同時顯示（a）模型評估指標和（b）模型屬性-線性回歸系數和截距
我們將首先生成一個包含所有這些值的Python字典，然后將其轉換為Pandas DataFrame

創建一個空白字典來保存模型屬性（系數，截距）和評估指標

dict_summary = {}

附加系數並攔截到字典

for m,feature in enumerate(features):
    dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_

將評估指標附加到字典

y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
                                                                y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
                                                                y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
                                    metrics.mean_squared_error(y_test, y_pred)
                                                        )

將字典轉換為摘要DataFrame顯示回歸模型屬性和評估指標

df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)

模型評估DataFrame輸出

print(df_metrics)
                           metric     value
0            Coefficient (inches)  0.466667
1              Coefficient (city)  0.333333
2                       Intercept  8.666667
3       Mean Absolute Error (MAE)  1.400000
4        Mean Squared Error (MSE)  3.346667
5  Root Mean Squared Error (RMSE)  1.829390

使用這種方法，由於您在Pandas 2 DataFrame獲得了結果，因此可以使用Pandas工具可視化回歸分析的結果。

使用python的線性回歸使用sklearn

問題描述

3 個解決方案

解決方案1
0 已采納 2019-04-24 02:14:30

解決方案2
0 2019-04-24 02:26:42

解決方案3
0 2019-04-24 15:12:27

使用python的線性回歸使用sklearn

問題描述

3 個解決方案

解決方案1 0 已采納 2019-04-24 02:14:30

解決方案2 0 2019-04-24 02:26:42

解決方案3 0 2019-04-24 15:12:27

解決方案1
0 已采納 2019-04-24 02:14:30

解決方案2
0 2019-04-24 02:26:42

解決方案3
0 2019-04-24 15:12:27