i try to do Linear regression with python
For example
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]
Now the example_data like:
inches city Pizza_Price
5 A 10
6 B 12
inches is a clear number but area is not.
How can I convert a city to a number for calculation?
How to classify parameters like city into numbers for calculation?
Code:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example
le.fit(cities)
cities = le.transform(cities)
print(cities)
Output:
[0, 1, 2, 1]
You'll want to use pd.get_dummies() to convert each city into a binary value. Label encoder will assign an integer value to the variable which will make interpreting the regression formula difficult and possibly skewed. Remember to drop one of the dummy variables to avoid multi-collinearity.
From your data shown in example_data
, it looks like you are working with data in a Pandas DataFrame
. So, I would suggest another possible approach to answering your question
Here is some data I generated in the same format as yours but with extra rows
d = [
['inches','city','Pizza_Price'],
[5,'A',10],
[6,'B',12],
[7,'C',15],
[8,'D',11],
[9,'B',12],
[10,'C',17],
[11,'D',16]
]
df = pd.DataFrame(d[1:], columns=d[0])
print(df)
inches city Pizza_Price
0 5 A 10
1 6 B 12
2 7 C 15
3 8 D 11
4 9 B 12
5 10 C 17
6 11 D 16
The conversion of the city
column into integers can be done using LabelEncoder
(as shown in this SO post ), per @Wen-Ben's suggestion
df['city'] = pd.DataFrame(columns=['city'],
data=LabelEncoder().fit_transform(
df['city'].values.flatten())
)
print(df)
inches city Pizza_Price
0 5 0 10
1 6 1 12
2 7 2 15
3 8 3 11
4 9 1 12
5 10 2 17
6 11 3 16
Step 1. Perform the train-test split to get the training and testing data X_train
, y_train
, etc.
features = ['inches', 'city']
target = 'Pizza_Price'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3
Step 2. (Optional) Append a column to your source DataFrame
( example_data
) that shows which rows are used in training and testing
df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'
Step 3. Instantiate the LinearRegression
model and train the model using the training dataset - see this link from sklearn
docs
model = LinearRegression()
model.fit(X_train, y_train)
Step 4. Now, make out-of-sample predictions on the testing data and (optionally) append the predicted values as a separate column to the example_data
NaN
df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)
print(df)
inches city Pizza_Price Type Predicted_Pizza_Price
0 5 0 10 Test 11.0
1 6 1 12 Test 11.8
2 7 2 15 Train NaN
3 8 3 11 Train NaN
4 9 1 12 Train NaN
5 10 2 17 Test 14.0
6 11 3 16 Train NaN
Step 5. Generate model evaluation metrics (see point number 15. from here )
DataFrame
showing both the (a) model evaluation metrics and (b) model properties - the Linear Regression coefficients and the intercept DataFrame
Create a blank dictionary to hold the model properties (coefficient, intercept) and evaluation metrics
dict_summary = {}
Append coefficient and intercept to dictionary
for m,feature in enumerate(features):
dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_
Append evaluation metrics to dictionary
y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
metrics.mean_squared_error(y_test, y_pred)
)
Convert dictionary into summary DataFrame
showing regression model properties and evaluation metrics
df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)
Output of model evaluation DataFrame
print(df_metrics)
metric value
0 Coefficient (inches) 0.466667
1 Coefficient (city) 0.333333
2 Intercept 8.666667
3 Mean Absolute Error (MAE) 1.400000
4 Mean Squared Error (MSE) 3.346667
5 Root Mean Squared Error (RMSE) 1.829390
Using this approach, since you have results in Pandas 2 DataFrame
s, Pandas tools can be used to visualize the results of the regression analysis.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.