SparkML MultilayerPerceptron错误：java.lang.ArrayIndexOutOfBoundsException

Question

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier() . 我想使用SparkML MultilayerPerceptronClassifier()估算以下模型。

val formula = new RFormula()
  .setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
  .setFeaturesCol("features")
  .setLabelCol("label")

formula.fit(data).transform(data)

Note: The features is a vector and label is a Double 注意：特征是向量，标签是Double

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

I define my MLP estimator as follows: 我将MLP估算器定义如下：

val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong

val mlp = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

// train the model
val model = mlp.fit(train)

Unfortunately, I got the following error: 不幸的是，我收到以下错误：

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 使用Spark的默认log4j配置文件：org / apache / spark / log4j-defaults.properties

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11 at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950) ... 线程“主”中的异常org.apache.spark.SparkException：由于阶段失败而导致作业中止：阶段3.0中的任务0失败1次，最近一次失败：阶段3.0中的任务0.0丢失（TID 3，本地主机，执行程序驱动程序）： java.lang.ArrayIndexOutOfBoundsException：11 at org.apache.spark.ml.classification.LabelConverter $ .encodeLabeledPoint（MultilayerPerceptronClassifier.scala：121）at org.apache.spark.ml.classification.MultilayerPerceptronClassifier $$ anonfun $ 3.apply（MultilayerPerceptronClassifier scala：245）在org.apache.spark.ml.classification.MultilayerPerceptronClassifier $$ anonfun $ 3.apply（MultilayerPerceptronClassifier.scala：245）在scala.collection.Iterator $$ anon $ 11.next（Iterator.scala：363）在scala .collection.Iterator $ GroupedIterator.takeDestructively（Iterator.scala：935）在scala.collection.Iterator $ GroupedIterator.go（Iterator.scala：950）...

Answer 1

org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121) org.apache.spark.ml.classification.LabelConverter $ .encodeLabeledPoint（MultilayerPerceptronClassifier.scala：121）

This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there: 这告诉我们，数组在MultilayerPerceptronClassifier.scala文件中超出范围，让我们看一下其中的代码：

def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
  val output = Array.fill(labelCount)(0.0)
  output(labeledPoint.label.toInt) = 1.0
  (labeledPoint.features, Vectors.dense(output))
}

It performs a one-hot encoding of the labels in the dataset. 它对数据集中的标签执行一键编码。 The ArrayIndexOutOfBoundsException occurs since the output array is too short. ArrayIndexOutOfBoundsException由于output数组太短而发生。

By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. 通过返回代码，可以发现labelCount与layers数组中的输出节点数相同。 In other words, the number of output nodes should be the same as the number of classes. 换句话说，输出节点的数量应与类的数量相同。 Looking at the documentation for MLP there is the following line: 查看有关MLP的文档，有以下几行：

The number of nodes N in the output layer corresponds to the number of classes. 输出层中的节点数N对应于类数。

The solution is therefore to either: 因此，解决方案是：

Change the number of nodes in the final layer of the network (output nodes) 更改网络最后一层（输出节点）中的节点数
Reconstruct the data to have the same number of classes as your network output nodes. 重构数据，使其具有与网络输出节点相同的类数。

Note : The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense. 注意：最终输出层应始终为2或更多，而不应为1，因为每个类应该有一个节点，并且单个类的问题没有意义。

Answer 2

rearrange your dataset as the error shows you have fewer arrays than you have in your features set or your data set has a null set which prompted an error.I came across this type of error while working on my MLP project.hope my answer helps you. 重新排列数据集，因为错误显示您的数组比要素集中的数组少或数据集的空集提示错误。我在处理MLP项目时遇到了此类错误。希望我的回答对您有所帮助。 thanks for reaching out 谢谢你伸出手

Answer 3

The solution is to first find the local optimal that allows one to escape the ArrayIndexOutBound and then use brute-force search to find the global optimal. 解决方案是先找到允许逃脱ArrayIndexOutBound的局部最优值，然后使用蛮力搜索来找到全局最优值。 Shaido suggest finding n Shaido建议找到n

For example, val layers = Array[Int](6, 5, 8, n). 例如，val层= Array [Int]（6、5、8，n）。 This assumes the length of the feature vectors are 6. – Shaido 假设特征向量的长度为6。– Shaido

So make n be a large integer( n =100 ) then manually use brute-force search to arrive at a good solution( n =50 then try n =32 - error, n = 35 - perfect). 因此，使n为一个大整数（ n =100 ），然后手动使用蛮力搜索得出一个好的解决方案（ n =50然后尝试n =32错误， n = 35完美）。

Credit to Shaido. 感谢Shaido。

SparkML MultilayerPerceptron错误：java.lang.ArrayIndexOutOfBoundsException

问题描述

3 个解决方案

解决方案1
4 2017-12-19 05:48:04

解决方案2
0 2017-12-23 00:04:35

解决方案3
-2 已采纳 2017-12-20 15:35:00

SparkML MultilayerPerceptron错误：java.lang.ArrayIndexOutOfBoundsException

问题描述

3 个解决方案

解决方案1 4 2017-12-19 05:48:04

解决方案2 0 2017-12-23 00:04:35

解决方案3 -2 已采纳 2017-12-20 15:35:00

解决方案1
4 2017-12-19 05:48:04

解决方案2
0 2017-12-23 00:04:35

解决方案3
-2 已采纳 2017-12-20 15:35:00