Can't figure out how to use LinearRegression at pyspark 1.6 & python2.7

Question

My spark version is 1.6
My python version is 2.7

My data is below,

x = [300,400,500,500,800,1000,1000,1300]  
y = [9500,10300,11000,12000,12400,13400,14500,15300]


+----+-----+
|   x|    y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+

My wrong codes ,

from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression 
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]

df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)

lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)

I want to run like this example:

>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, 2.0, Vectors.dense(1.0)),
...     (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)

I can's figure out how to transfer my data to example data type.

+-----+------+---------+
|label|weight| features|
+-----+------+---------+
|  1.0|   2.0|    [1.0]|
|  0.0|   2.0|(1,[],[])|
+-----+------+---------+

Any comments will be much appreciated.
Thank you for your help.

Answer 1

from pyspark import  SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
         .setMaster("local[*]")
         .setAppName('pyspark')
        )
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)


df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])

print(df.show())

lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)

Outputs:

[0.24955041614]  
1.87657354351

I sucess!
But coefficients and intercept are different than statsmodels.api.OLS.

import numpy as np
import statsmodels.api as sm

Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

Outputs:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.161
Model:                            OLS   Adj. R-squared:                 -0.007
Method:                 Least Squares   F-statistic:                    0.9608
Date:                Fri, 07 Apr 2017   Prob (F-statistic):              0.372
Time:                        02:09:45   Log-Likelihood:                -10.854
No. Observations:                   7   AIC:                             25.71
Df Residuals:                       5   BIC:                             25.60
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          2.1429      1.141      1.879      0.119        -0.789     5.075
x1             0.2500      0.255      0.980      0.372        -0.406     0.906
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.743
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.482
Skew:                           0.206   Prob(JB):                        0.786
Kurtosis:                       1.782   Cond. No.                         10.4
==============================================================================

Can't figure out how to use LinearRegression at pyspark 1.6 & python2.7

Question

1 answers

solution1
0 2017-04-19 10:30:11

Can't figure out how to use LinearRegression at pyspark 1.6 & python2.7

Question

1 answers

solution1 0 2017-04-19 10:30:11

solution1
0 2017-04-19 10:30:11