简体   繁体   English

在线性回归中使用 gre 预测录取率

[英]predicting admission rate using gre in linear regression

I am learning linear regression and I am trying to make a simple linear regression program in Jupyter notebook in python, I am using the data from kaggle here is the link https://www.kaggle.com/mohansacharya/graduate-admissions我正在学习线性回归,我正在尝试在 python 的 Jupyter 笔记本中制作一个简单的线性回归程序,我使用来自 kaggle 的数据是链接https://www.kaggle.com/mohansacharya/graduate-admissions
to predict the relationship between GRE score and the chance of admission, but I keep on getting a negative slope, even if its a positive correlation预测 GRE 分数和录取机会之间的关系,但我一直得到一个负斜率,即使它是正相关

this is the code that I am executing这是我正在执行的代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]
plt.scatter(X, Y)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.show()

m = 0
c = 0

L = 0.0001  # The learning Rate
epochs = 10000  # The number of iterations to perform gradient descent

n = float(len(X)) # Number of elements in X

# Performing Gradient Descent 
for i in range(epochs): 
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c
    
print (m, c)

when I print m and c I get 'nan' and 'nan' as output, what am I doing wrong?当我打印 m 和 c 时,我得到 'nan' 和 'nan' 作为 output,我做错了什么?

The problem here is the learning rate.这里的问题是学习率。 If you decrease the learning rate you can get an okayish fit.如果你降低学习率,你可以得到一个不错的配合。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

# Preprocessing Input data
data = pd.read_csv('gre.csv')
X = data.iloc[:, 1]
Y = data.iloc[:, 8]

m = 0
c = 0
L = 0.0000001  # The learning Rate
epochs = 100  # The number of iterations to perform gradient descent
n = float(len(X))  # Number of elements in X

# Performing Gradient Descent
for i in range(epochs):    
    Y_pred = m*X + c  # The current predicted value of Y
    D_m = (-2/n) * sum(X * (Y - Y_pred))  # Derivative wrt m
    D_c = (-2/n) * sum(Y - Y_pred)  # Derivative wrt c
    m = m - L * D_m  # Update m
    c = c - L * D_c  # Update c    

print("Slope, Intercept:", m, c)

plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
axes = plt.gca()
Y_preds = c + m * X
plt.scatter(X, Y)
plt.plot(X, Y_preds, '--')
plt.show()

Output: Output:

Slope, Intercept: 0.0019885000304672488 6.212311206699001e-06

在此处输入图像描述

If you use scikit-learn implementation you get a better fit.如果你使用scikit-learn实现,你会得到更合适的。 As it uses normalization and Least Squares Estimates method rather than gradient descent.因为它使用normalizationLeast Squares Estimates方法而不是梯度下降。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('gre.csv')

X, y = data.iloc[:, 1], data.iloc[:, 8].values
X = X.values.reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X, y)

y_pred = regr.predict(X)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, y_pred))

# Plot outputs
plt.rcParams['figure.figsize'] = (12.0, 9.0)
plt.xlabel('GRE score')
plt.ylabel('Chance of getting into university %')
plt.scatter(X, y,  color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Output: Output:

Coefficients: 
 [0.01012587]
Mean squared error: 0.01
Coefficient of determination: 0.66

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM