简体   繁体   中英

Spark mllib linear regression giving really bad results

I've been getting really poor results when trying to do a linear regression using Spark mllib's LinearRegressionWithSGD using Python.

I looked into similiar questions, like the following :

I am well aware that the key is to tweak the parameters just right .

I also understand that Stochastic Gradient Descent won't necessarily find an optimal solution (like Alternating Least Squares does) due to the chance of getting stuck in a local minimum. But at least I would expect to find an OK model.

Here is my setup, I choose to use this example from the Journal of Statistics education and the corresponding dataset . I know from this paper (and from replicating the results in JMP) that if I use only the numerical fields I should get something similar to the following equation (with an R^2 of ~44% and a RMSE of ~7400):

Price = 7323 - 0.171 Mileage + 3200 Cylinder - 1463 Doors + 6206 Cruise - 2024 Sound + 3327 Leather

Since I didn't know how to set up the parameters just right , I ran the following brute force approach:

from collections import Iterable
from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.evaluation import RegressionMetrics

def f(n):
    return float(n)

if __name__ == "__main__":
    sc = SparkContext(appName="LinearRegressionExample")

    # CSV file format:
    # 0      1        2     3      4     5     6         7      8      9       10     11
    # Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, Leather
    raw_data = sc.textFile('file:///home/ccastroh/training/pyspark/kuiper.csv')

    # Grabbing numerical values only (for now)
    data = raw_data \
        .map(lambda x : x.split(','))  \
        .map(lambda x : [f(x[0]), f(x[1]), f(x[6]), f(x[8]), f(x[9]), f(x[10]), f(x[11])])
    points = data.map(lambda x : LabeledPoint(x[0], x[1:])).cache()

    print "Num, Iterations, Step, MiniBatch, RegParam, RegType, Intercept?, Validation?, " + \
        "RMSE, R2, EXPLAINED VARIANCE, INTERCEPT, WEIGHTS..."
    i = 0
    for ite in [10, 100, 1000]:
      for stp in [1, 1e-01, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06, 1e-07, 1e-08, 1e-09, 1e-10]:
        for mini in [0.2, 0.4, 0.6, 0.8, 1.0]:
          for regP in [0.0, 0.1, 0.01, 0.001]:
            for regT in [None, 'l1', 'l2']:
              for intr in [True]:
                for vald in [False, True]:
                  i += 1

                  message = str(i) + \
                      "," + str(ite) + \
                      "," + str(stp) + \
                      "," + str(mini) + \
                      "," + str(regP) + \
                      "," + str(regT) + \
                      "," + str(intr) + \
                      "," + str(vald)

                  model = LinearRegressionWithSGD.train(points, iterations=ite, step=stp, \
                      miniBatchFraction=mini, regParam=regP, regType=regT, intercept=intr, \
                      validateData=vald)

                  predictions_observations = points \
                      .map(lambda p : (float(model.predict(p.features)), p.label)).cache()
                  metrics = RegressionMetrics(predictions_observations)
                  message += "," + str(metrics.rootMeanSquaredError) \
                     + "," + str(metrics.r2) \
                     + "," + str(metrics.explainedVariance)

                  message += "," + str(model.intercept)
                  for weight in model.weights:
                      message += "," + str(weight)

                  print message
    sc.stop()

As you can see, I basically ran 3960 different variations. In none of those did I get anything that remotely resembles the formula from the paper or JMP. Here are some highlights:

  • In a lot of the runs I got NaN for the intercept and weights
  • The highest R^2 that I got was -0.89. Which I didn't even know you could get a negative R^2. It turns out a negative value indicates that the chosen model fits worse than a horizontal line .
  • The lowest RMSE that I got was 13600, which is way worse than the expected 7400.

I also tried normalizing the values so that there are in the [0,1] range, and that didn't help either

Does anyone have any idea of how to get a Linear Regression model that is half decent? Am I missing something?

Have a similar problem. Used DecisionTree and RandomForest regression which works fine, althrough it is not great with producing continuous labels if you want to have a quite accurate solution.

Then tested linear regression also like you did with multiple values for each parameter and also using multiple datasets and didnt get any solution that comes remotely close to the real value. Also tried to use StandardScaler for feature scaling before training the model, but also not satisfying at all. :-(

EDIT: Setting intercept to true might solve the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM