RandomForestClassifier for multiclass classification Spark 2.x

Question

I'm trying to use random forest for a multiclass classification using spark 2.1.1

After defining my pipeline as usual, it's failing during indexing stage.

I have a dataframe with many string type columns. I have created a StringIndexer for each of them.

I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter.

I've checked all my columns with distinct().count() to make sure I do not have too many categories and so on...

After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors... When calling:

  val indexer = udf { label: String =>
  if (labelToIndex.contains(label)) {
    labelToIndex(label)
  } else {
    throw new SparkException(s"Unseen label: $label.")
  }
}

Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'

Then inside the transformation, there is this error when defining the metadata:

Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex Method threw 'java.lang.NullPointerException' exception. Cannot evaluate org.apache.spark.sql.types.Metadata.toString()

This is happening because I have null on some columns that I'm indexing.

I could reproduce the error with the following example.

val df = spark.createDataFrame(
  Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0), 
      (null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1), 
      ("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")

val indexer = new 
StringIndexer().setInputCol("x0").setOutputCol("x0idx")

indexer.fit(df).transform(df).show

// java.lang.NullPointerException

https://issues.apache.org/jira/browse/SPARK-11569

https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

Answer 1

可以使用此处提供的解决方案，并且在Spark 2.2.0上，此问题已在上游修复。

Answer 2

You can use DataFrame.na.fill(Map("colName1", val1), ("colName2", val2),..))

Where:

DataFrame - DataFrame Object ; "colName" - name of the column & val - value for replacing nulls if any found in column "colName".

Use feature transformations, after filling all nulls.

You can check for number of nulls in all columns of as:

for ( column <- DataFrame.columns ) { DataFrame.filter(DataFrame(column) === null || DataFrame(column).isNull || DataFrame(column).isNan).count()

}

OR

DataFrame.count() will give you total number of rows in DataFrame. Then number of nulls can be judged by DataFrame.describe()

RandomForestClassifier for multiclass classification Spark 2.x

Question

2 answers

solution1
0 ACCPTED 2018-02-14 23:43:14

solution2
0 2018-02-15 21:45:29

RandomForestClassifier for multiclass classification Spark 2.x

Question

2 answers

solution1 0 ACCPTED 2018-02-14 23:43:14

solution2 0 2018-02-15 21:45:29

solution1
0 ACCPTED 2018-02-14 23:43:14

solution2
0 2018-02-15 21:45:29