predicting admission rate using gre in linear regression

Question

I am learning linear regression and I am trying to make a simple linear regression program in Jupyter notebook in python, I am using the data from kaggle here is the link https://www.kaggle.com/mohansacharya/graduate-admissions
to predict the relationship between GRE score and the chance of admission, but I keep on getting a negative slope, even if its a positive correlation

this is the code that I am executing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()

m = 0
c = 0

L = 0.0001  # The learning Rate
epochs = 10000  # The number of iterations to perform gradient descent

n = float(len(X)) # Number of elements in X

# Performing Gradient Descent 
for i in range(epochs): 
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c
    
print (m, c)

when I print m and c I get 'nan' and 'nan' as output, what am I doing wrong?

Answer 1

The problem here is the learning rate. If you decrease the learning rate you can get an okayish fit.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]

m = 0
c = 0
L = 0.0000001  # The learning Rate
epochs = 100  # The number of iterations to perform gradient descent
n = float(len(X))  # Number of elements in X

# Performing Gradient Descent
for i in range(epochs):    
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c    

print("Slope, Intercept:", m, c)

plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()

Output:

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06

If you use scikit-learn implementation you get a better fit. As it uses normalization and Least Squares Estimates method rather than gradient descent.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('gre.csv')

X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X, y)

y_pred = regr.predict(X)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))

# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y,  color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Output:

Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66

predicting admission rate using gre in linear regression

Question

1 answers

solution1
2 ACCPTED 2020-07-25 07:22:42

predicting admission rate using gre in linear regression

Question

1 answers

solution1 2 ACCPTED 2020-07-25 07:22:42

solution1
2 ACCPTED 2020-07-25 07:22:42