[英]Can't figure out how to use LinearRegression at pyspark 1.6 & python2.7
My spark version is 1.6 我的Spark版本是1.6
My python version is 2.7 我的python版本是2.7
My data is below, 我的数据如下,
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
+----+-----+
| x| y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+
My wrong codes , 我的错误代码,
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)
lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)
I want to run like this example: 我想像这样运行:
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
... (1.0, 2.0, Vectors.dense(1.0)),
... (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)
I can's figure out how to transfer my data to example data type. 我可以弄清楚如何将数据转换为示例数据类型。
+-----+------+---------+
|label|weight| features|
+-----+------+---------+
| 1.0| 2.0| [1.0]|
| 0.0| 2.0|(1,[],[])|
+-----+------+---------+
Any comments will be much appreciated. 任何意见将不胜感激。
Thank you for your help. 谢谢您的帮助。
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
.setMaster("local[*]")
.setAppName('pyspark')
)
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)
df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
print(df.show())
lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)
Outputs: 输出:
[0.24955041614]
1.87657354351
I sucess! 我成功了!
But coefficients and intercept are different than statsmodels.api.OLS. 但是系数和截距与statsmodels.api.OLS不同。
import numpy as np
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())
Outputs: 输出:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.161
Model: OLS Adj. R-squared: -0.007
Method: Least Squares F-statistic: 0.9608
Date: Fri, 07 Apr 2017 Prob (F-statistic): 0.372
Time: 02:09:45 Log-Likelihood: -10.854
No. Observations: 7 AIC: 25.71
Df Residuals: 5 BIC: 25.60
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 2.1429 1.141 1.879 0.119 -0.789 5.075
x1 0.2500 0.255 0.980 0.372 -0.406 0.906
==============================================================================
Omnibus: nan Durbin-Watson: 1.743
Prob(Omnibus): nan Jarque-Bera (JB): 0.482
Skew: 0.206 Prob(JB): 0.786
Kurtosis: 1.782 Cond. No. 10.4
==============================================================================
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.