I am learning linear regression and I am trying to make a simple linear regression program in Jupyter notebook in python, I am using the data from kaggle here is the link https://www.kaggle.com/mohansacharya/graduate-admissions
to predict the relationship between GRE score and the chance of admission, but I keep on getting a negative slope, even if its a positive correlation
this is the code that I am executing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)
# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()
m = 0
c = 0
L = 0.0001 # The learning Rate
epochs = 10000 # The number of iterations to perform gradient descent
n = float(len(X)) # Number of elements in X
# Performing Gradient Descent
for i in range(epochs):
Y_pred = m*X + c # The current predicted value of Y
D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m
D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c
m = m - L * D_m # Update m
c = c - L * D_c # Update c
print (m, c)
when I print m and c I get 'nan' and 'nan' as output, what am I doing wrong?
The problem here is the learning rate. If you decrease the learning rate you can get an okayish fit.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)
# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
m = 0
c = 0
L = 0.0000001 # The learning Rate
epochs = 100 # The number of iterations to perform gradient descent
n = float(len(X)) # Number of elements in X
# Performing Gradient Descent
for i in range(epochs):
Y_pred = m*X + c # The current predicted value of Y
D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m
D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c
m = m - L * D_m # Update m
c = c - L * D_c # Update c
print("Slope, Intercept:", m, c)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()
Output:
Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06
If you use scikit-learn
implementation you get a better fit. As it uses normalization
and Least Squares Estimates
method rather than gradient descent.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
data = pd.read_csv('gre.csv')
X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(X, y)
y_pred = regr.predict(X)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))
# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Output:
Coefficients:
[0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.