简体   繁体   中英

Linear regression with python use sklearn

i try to do Linear regression with python

For example

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]

Now the example_data like:

inches city  Pizza_Price
  5       A        10
  6       B        12

inches is a clear number but area is not.

How can I convert a city to a number for calculation?

How to classify parameters like city into numbers for calculation?

Code:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example

le.fit(cities)
cities = le.transform(cities)
print(cities)

Output:

[0, 1, 2, 1]

Label Encoder explanation

You'll want to use pd.get_dummies() to convert each city into a binary value. Label encoder will assign an integer value to the variable which will make interpreting the regression formula difficult and possibly skewed. Remember to drop one of the dummy variables to avoid multi-collinearity.

From your data shown in example_data , it looks like you are working with data in a Pandas DataFrame . So, I would suggest another possible approach to answering your question

Here is some data I generated in the same format as yours but with extra rows

d = [
    ['inches','city','Pizza_Price'],
    [5,'A',10],
    [6,'B',12],
    [7,'C',15],
    [8,'D',11],
    [9,'B',12],
    [10,'C',17],
    [11,'D',16]
    ]
df = pd.DataFrame(d[1:], columns=d[0])

print(df)
   inches city  Pizza_Price
0       5    A           10
1       6    B           12
2       7    C           15
3       8    D           11
4       9    B           12
5      10    C           17
6      11    D           16

The conversion of the city column into integers can be done using LabelEncoder (as shown in this SO post ), per @Wen-Ben's suggestion

df['city'] = pd.DataFrame(columns=['city'],
                        data=LabelEncoder().fit_transform(
                            df['city'].values.flatten())
                        )

print(df)
   inches  city  Pizza_Price
0       5     0           10
1       6     1           12
2       7     2           15
3       8     3           11
4       9     1           12
5      10     2           17
6      11     3           16

Step 1. Perform the train-test split to get the training and testing data X_train , y_train , etc.

features = ['inches', 'city']
target = 'Pizza_Price'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42)

# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3

Step 2. (Optional) Append a column to your source DataFrame ( example_data ) that shows which rows are used in training and testing

df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'

Step 3. Instantiate the LinearRegression model and train the model using the training dataset - see this link from sklearn docs

model = LinearRegression()
model.fit(X_train, y_train)

Step 4. Now, make out-of-sample predictions on the testing data and (optionally) append the predicted values as a separate column to the example_data

  • the rows used in the training dataset will have no prediction so they will be assigned NaN
  • the rows used in the testing dataset will have a prediction
df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)

print(df)
   inches  city  Pizza_Price   Type  Predicted_Pizza_Price
0       5     0           10   Test                   11.0
1       6     1           12   Test                   11.8
2       7     2           15  Train                    NaN
3       8     3           11  Train                    NaN
4       9     1           12  Train                    NaN
5      10     2           17   Test                   14.0
6      11     3           16  Train                    NaN

Step 5. Generate model evaluation metrics (see point number 15. from here )

  • we will generate a Pandas DataFrame showing both the (a) model evaluation metrics and (b) model properties - the Linear Regression coefficients and the intercept
  • we will first generate a Python dictionary that contains all these values and then convert the dictionary to a Pandas DataFrame

Create a blank dictionary to hold the model properties (coefficient, intercept) and evaluation metrics

dict_summary = {}

Append coefficient and intercept to dictionary

for m,feature in enumerate(features):
    dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_

Append evaluation metrics to dictionary

y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
                                                                y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
                                                                y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
                                    metrics.mean_squared_error(y_test, y_pred)
                                                        )

Convert dictionary into summary DataFrame showing regression model properties and evaluation metrics

df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)

Output of model evaluation DataFrame

print(df_metrics)
                           metric     value
0            Coefficient (inches)  0.466667
1              Coefficient (city)  0.333333
2                       Intercept  8.666667
3       Mean Absolute Error (MAE)  1.400000
4        Mean Squared Error (MSE)  3.346667
5  Root Mean Squared Error (RMSE)  1.829390

Using this approach, since you have results in Pandas 2 DataFrame s, Pandas tools can be used to visualize the results of the regression analysis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM