简体繁体 English

如何计算单变量线性回归 model 中的 theta？

[英]how to calculate theta in univariate linear regression model?

原文 2018-01-22 01:27:21 4 2 machine-learning/ linear-regression

I have hypothesis function h(x) = theta0 + theta1*x .我有假设 function h(x) = theta0 + theta1*x 。

How can I select theta0 and theta1 value for the linear regression model?我怎样才能将select的theta0和theta1值进行线性回归model？

2 个解决方案

The question is unclear whether you would like to do this by hand (with the underlying math), use a program like Excel, or solve in a language like MATLAB or Python. 问题尚不清楚，您是否想手动执行此操作（使用基础数学），使用Excel之类的程序还是使用MATLAB或Python之类的语言进行求解。

To start, here is a website offering a summary of the math involved for a univariate calculation: http://www.statisticshowto.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/ 首先，这里是一个提供有关单变量计算的数学摘要的网站： http : //www.statisticshowto.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/

Here, there is some discussion of the matrix formulation of the multivariate problem (I know you asked for univariate but some people find the matrix formulation helps them conceptualize the problem): https://onlinecourses.science.psu.edu/stat501/node/382 在这里，对多元问题的矩阵公式进行了一些讨论（我知道您要求单变量，但是有些人发现矩阵公式有助于他们将问题概念化）： https : //onlinecourses.science.psu.edu/stat501/node / 382

We should start with a bit of an intuition, based on the level of the question. 基于问题的级别，我们应该从一些直觉开始。 The goal of a linear regression is to find a set of variables, in your case thetas, that minimize the distance between the line formed and the data points observed (often, the square of this distance). 线性回归的目的是找到一组变量（在您的情况下为thetas），以最小化所形成的线与所观察到的数据点之间的距离（通常为该距离的平方）。 You have two "free" variables in the equation you defined. 您定义的方程式中有两个“自由”变量。 First, theta0: this is the intercept. 首先是theta0：这是截距。 The intercept is the value of the response variable (h(x)) when the input variable (x) is 0. This visually is the point where the line will cross the y axis. 截距是输入变量（x）为0时响应变量（h（x））的值。这是直线与y轴交叉的点。 The second variable you have defined is the slope (theta1), this variable expresses how much the response variable changes when the input changes. 您定义的第二个变量是斜率（theta1），该变量表示输入变化时响应变量的变化量。 If theta1 = 0, h(x) does not change when x changes. 如果theta1 = 0，则x改变时h（x）不变。 If theta1 = 1, h(x) increases and decreases at the same rate as x. 如果theta1 = 1，则h（x）以与x相同的速率增加和减少。 If theta1 = -1, h(x) responds in the opposite direction: if x increases, h(x) decreases by the same amount; 如果theta1 = -1，则h（x）的响应方向相反：如果x增加，则h（x）减小相同的量； if x decreases, h(x) increases by the quantity. 如果x减小，则h（x）增加该数量。

For more information, Mathworks provides a fairly comprehensive explanation: https://www.mathworks.com/help/symbolic/mupad_ug/univariate-linear-regression.html 有关更多信息，Mathworks提供了相当全面的解释： https : //www.mathworks.com/help/symbolic/mupad_ug/univariate-linear-regression.html

So after getting a handle on what we are doing conceptually, lets take a stab at the math. 因此，在掌握了我们在概念上所做的事情之后，让我们来研究一下数学。 We'll need to calculate the standard deviation of our two variables, x and h(x). 我们需要计算两个变量x和h（x）的标准偏差。 WTo calculate the standard deviation, we will calculate the mean of each variable (sum up all the x's and then divide by the number of x's, do the same for h(x)). 为了计算标准差，我们将计算每个变量的平均值（将所有x求和，然后除以x的数量，对h（x）进行相同操作）。 The standard deviation captures how much a variable differs from its mean. 标准差表示变量与平均值之间的差异。 For each x, subtract the mean of x. 对于每个x，减去x的平均值。 Sum these differences up and then divide by the number of x's minus 1. Finally, take the square root. 将这些差加起来，然后除以x的负1。最后，取平方根。 This is your standard deviation. 这是您的标准偏差。

Using this, we can normalize both variables. 使用此，我们可以将两个变量标准化。 For x, subtract the mean of x and divide by the standard deviation of x. 对于x，减去x的平均值，然后除以x的标准偏差。 Do this for h(x) as well. 对h（x）也是如此。 You will now have two lists of normalized numbers. 现在，您将有两个规范化数字列表。

For each normalized number, multiply the value by its pair (the first normalized x value with its h(x) pair, for all values). 对于每个归一化的数字，将该值乘以其对（对于所有值，将第一个归一化的x值与其h（x）对相乘）。 Add these products together and divide by N. This gives you the correlation. 将这些乘积相加并除以N。这样就可以得到相关性。 To get the least squares estimate of theta1, calculate this correlation value times the standard deviation of h(x) divided by the standard deviation of x. 要获得theta1的最小二乘估计，请计算此相关值乘以h（x）的标准偏差除以x的标准偏差。

Given all this information, calculating the intercept (theta0) is easy, all we'll have to do is take the mean of h(x) and subtract the product (multiply!) of our calculated theta1 and the average of x. 有了所有这些信息，计算截距（theta0）就很容易了，我们要做的就是取h（x）的平均值，然后减去我们计算出的theta1与x平均值的乘积（乘！）。

Phew! ！ All taken care of! 所有照顾！ We have our least squares solution for those two variables. 对于这两个变量，我们有最小二乘解。 Let me know if you have any questions! 如果您有任何疑问，请告诉我！ One last excellent resource: https://people.duke.edu/~rnau/mathreg.htm 最后一个出色的资源： https : //people.duke.edu/~rnau/mathreg.htm

If you are asking about the hypothesis function in linear regression, then those theta values are selected by an algorithm called gradient descent.如果您在线性回归中询问假设 function，那么这些 theta 值是通过称为梯度下降的算法选择的。 This helps in finding the theta values to minimize the cost function.这有助于找到 theta 值以最小化成本 function。