My spark version is 1.6
My python version is 2.7
My data is below,
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
+----+-----+
| x| y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+
My wrong codes ,
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)
lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)
I want to run like this example:
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
... (1.0, 2.0, Vectors.dense(1.0)),
... (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)
I can's figure out how to transfer my data to example data type.
+-----+------+---------+
|label|weight| features|
+-----+------+---------+
| 1.0| 2.0| [1.0]|
| 0.0| 2.0|(1,[],[])|
+-----+------+---------+
Any comments will be much appreciated.
Thank you for your help.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
.setMaster("local[*]")
.setAppName('pyspark')
)
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)
df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
print(df.show())
lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)
Outputs:
[0.24955041614]
1.87657354351
I sucess!
But coefficients and intercept are different than statsmodels.api.OLS.
import numpy as np
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())
Outputs:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.161
Model: OLS Adj. R-squared: -0.007
Method: Least Squares F-statistic: 0.9608
Date: Fri, 07 Apr 2017 Prob (F-statistic): 0.372
Time: 02:09:45 Log-Likelihood: -10.854
No. Observations: 7 AIC: 25.71
Df Residuals: 5 BIC: 25.60
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 2.1429 1.141 1.879 0.119 -0.789 5.075
x1 0.2500 0.255 0.980 0.372 -0.406 0.906
==============================================================================
Omnibus: nan Durbin-Watson: 1.743
Prob(Omnibus): nan Jarque-Bera (JB): 0.482
Skew: 0.206 Prob(JB): 0.786
Kurtosis: 1.782 Cond. No. 10.4
==============================================================================
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.