Spark MLLib線性回歸模型截距始終為0.0？

Question

我剛剛開始使用ML和Apache Spark，所以我一直在嘗試基於Spark示例的線性回歸。 除了示例中的示例之外，我似乎無法為任何數據生成適當的模型，並且無論輸入數據如何，截距始終為0.0。

我已經准備了一個基於該功能的簡單訓練數據集：

y =（2 * x1）+（3 * x2）+4

即我期望截距為4，權重為（2,3）。

如果我在原始數據上運行LinearRegressionWithSGD.train（...），模型是：

Model intercept: 0.0, weights: [NaN,NaN]

並且預測都是NaN：

Features: [1.0,1.0], Predicted: NaN, Actual: 9.0
Features: [1.0,2.0], Predicted: NaN, Actual: 12.0

等等

如果我首先縮放數據，我得到：

Model intercept: 0.0, weights: [17.407863391511754,2.463212481736855]

Features: [1.0,1.0], Predicted: 19.871075873248607, Actual: 9.0
Features: [1.0,2.0], Predicted: 22.334288354985464, Actual: 12.0
Features: [1.0,3.0], Predicted: 24.797500836722318, Actual: 15.0

等等

要么我做錯了，要么我不明白這個模型的輸出應該是什么，那么有人可以建議我在哪里出錯嗎？

我的代碼如下：

   // Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
   // i.e. intercept should be 4, weights (2, 3)?
   val data = sc.textFile("data/dummydata.txt")

   // LabeledPoint is (label, [features])
   val parsedData = data.map { line =>
    val parts = line.split(',')
    val label = parts(0).toDouble
    val features = Array(parts(1), parts(2)) map (_.toDouble)
    LabeledPoint(label, Vectors.dense(features))
  }

  // Scale the features
  val scaler = new StandardScaler(withMean = true, withStd = true)
                   .fit(parsedData.map(x => x.features))
  val scaledData = parsedData
                  .map(x => 
                  LabeledPoint(x.label, 
                     scaler.transform(Vectors.dense(x.features.toArray))))

  // Building the model: SGD = stochastic gradient descent
  val numIterations = 1000
  val step = 0.2
  val model = LinearRegressionWithSGD.train(scaledData, numIterations, step)

  println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")`

  // Evaluate model on training examples
  val valuesAndPreds = scaledData.map { point =>
    val prediction = model.predict(point.features)
    (point.label, point.features, prediction)
  }
  // Print out features, actual and predicted values...
  valuesAndPreds.take(10).foreach({case (v, f, p) => 
      println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")})

Answer 1

@Noah：謝謝 - 你的建議促使我再次看一下，我在這里找到了一些示例代碼，允許你生成攔截，並通過優化器設置其他參數，例如迭代次數。

這是我修改后的模型生成代碼，它似乎對我的虛擬數據運行正常：

  // Building the model: SGD = stochastic gradient descent:
  // Need to setIntercept = true, and seems only to work with scaled data 
  val numIterations = 600
  val stepSize = 0.1
  val algorithm = new LinearRegressionWithSGD()
  algorithm.setIntercept(true)
  algorithm.optimizer
    .setNumIterations(numIterations)
    .setStepSize(stepSize)

  val model = algorithm.run(scaledData)

它似乎仍然需要縮放數據而不是原始數據作為輸入，但這對我的目的來說是可以的。

Answer 2

您正在使用的train方法是一種快捷方式，它將截距設置為零，並且不會嘗試找到一個。 如果使用基礎類，則可以獲得非零截距：

val model = new LinearRegressionWithSGD(step, numIterations, 1.0).
    setIntercept(true).
    run(scaledData)

現在應該給你一個攔截。

Spark MLLib線性回歸模型截距始終為0.0？

問題描述

2 個解決方案

解決方案1
11 2014-10-09 09:52:00

解決方案2
9 已采納 2014-10-08 15:03:59

Spark MLLib線性回歸模型截距始終為0.0？

問題描述

2 個解決方案

解決方案1 11 2014-10-09 09:52:00

解決方案2 9 已采納 2014-10-08 15:03:59

解決方案1
11 2014-10-09 09:52:00

解決方案2
9 已采納 2014-10-08 15:03:59