简体   繁体   中英

Can't figure out how to use LinearRegression at pyspark 1.6 & python2.7

My spark version is 1.6
My python version is 2.7

My data is below,

x = [300,400,500,500,800,1000,1000,1300]  
y = [9500,10300,11000,12000,12400,13400,14500,15300]


+----+-----+
|   x|    y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+

My wrong codes ,

from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression 
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]

df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)

lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)

I want to run like this example:

>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, 2.0, Vectors.dense(1.0)),
...     (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)

I can's figure out how to transfer my data to example data type.

+-----+------+---------+
|label|weight| features|
+-----+------+---------+
|  1.0|   2.0|    [1.0]|
|  0.0|   2.0|(1,[],[])|
+-----+------+---------+

Any comments will be much appreciated.
Thank you for your help.

from pyspark import  SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
         .setMaster("local[*]")
         .setAppName('pyspark')
        )
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)


df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])

print(df.show())

lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)

Outputs:

[0.24955041614]  
1.87657354351

I sucess!
But coefficients and intercept are different than statsmodels.api.OLS.

import numpy as np
import statsmodels.api as sm

Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

Outputs:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.161
Model:                            OLS   Adj. R-squared:                 -0.007
Method:                 Least Squares   F-statistic:                    0.9608
Date:                Fri, 07 Apr 2017   Prob (F-statistic):              0.372
Time:                        02:09:45   Log-Likelihood:                -10.854
No. Observations:                   7   AIC:                             25.71
Df Residuals:                       5   BIC:                             25.60
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          2.1429      1.141      1.879      0.119        -0.789     5.075
x1             0.2500      0.255      0.980      0.372        -0.406     0.906
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.743
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.482
Skew:                           0.206   Prob(JB):                        0.786
Kurtosis:                       1.782   Cond. No.                         10.4
==============================================================================

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM