用于多类分类Spark 2.x的RandomForestClassifier

Question

I'm trying to use random forest for a multiclass classification using spark 2.1.1 我正在尝试使用Spark 2.1.1将随机森林用于多类分类

After defining my pipeline as usual, it's failing during indexing stage. 在照常定义管道之后，在索引阶段它会失败。

I have a dataframe with many string type columns. 我有一个包含许多字符串类型列的数据框。 I have created a StringIndexer for each of them. 我为他们每个人创建了一个StringIndexer。

I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter. 我正在通过将StringIndexers与VectorAssembler以及最后的RandomForestClassifier链接到标签转换器之后，创建一条管道。

I've checked all my columns with distinct().count() to make sure I do not have too many categories and so on... 我已经检查了我所有的列并使用distinct().count()来确保我没有太多的类别，依此类推...

After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors... When calling: 经过一些调试之后，我了解到，每当我开始对某些列进行索引时，都会出现以下错误...在调用时：

  val indexer = udf { label: String =>
  if (labelToIndex.contains(label)) {
    labelToIndex(label)
  } else {
    throw new SparkException(s"Unseen label: $label.")
  }
}

Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'

Then inside the transformation, there is this error when defining the metadata: 然后在转换内部，定义元数据时会出现以下错误：

Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex Method threw 'java.lang.NullPointerException' exception. 错误评估方法：org $ apache $ spark $ ml $ feature $ StringIndexerModel $$ labelToIndex方法引发了“ java.lang.NullPointerException”异常。 Cannot evaluate org.apache.spark.sql.types.Metadata.toString() 无法评估org.apache.spark.sql.types.Metadata.toString（）

This is happening because I have null on some columns that I'm indexing. 发生这种情况是因为在索引的某些列上没有null。

I could reproduce the error with the following example. 我可以通过以下示例重现该错误。

val df = spark.createDataFrame(
  Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0), 
      (null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1), 
      ("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")

val indexer = new 
StringIndexer().setInputCol("x0").setOutputCol("x0idx")

indexer.fit(df).transform(df).show

// java.lang.NullPointerException

https://issues.apache.org/jira/browse/SPARK-11569 https://issues.apache.org/jira/browse/SPARK-11569

https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

Answer 1

可以使用此处提供的解决方案，并且在Spark 2.2.0上，此问题已在上游修复。

Answer 2

You can use DataFrame.na.fill(Map("colName1", val1), ("colName2", val2),..)) 您可以使用DataFrame.na.fill（Map（“ colName1”，val1），（“ colName2”，val2），..））

Where: 哪里：

DataFrame - DataFrame Object ; DataFrame-DataFrame对象； "colName" - name of the column & val - value for replacing nulls if any found in column "colName". “ colName”-列名，val-替换为“ colName”列中的空值的值。

Use feature transformations, after filling all nulls. 填充所有空值后，使用特征转换。

You can check for number of nulls in all columns of as: 您可以在as的所有列中检查空数：

for ( column <- DataFrame.columns ) { DataFrame.filter(DataFrame(column) === null || DataFrame(column).isNull || DataFrame(column).isNan).count() for（列<-DataFrame.columns）{DataFrame.filter（DataFrame（column）=== null || DataFrame（column）.isNull || DataFrame（column）.isNan）.count（）

} }

OR 要么

DataFrame.count() will give you total number of rows in DataFrame. DataFrame.count（）将为您提供DataFrame中的总行数。 Then number of nulls can be judged by DataFrame.describe() 然后可以通过DataFrame.describe（）判断空数

用于多类分类Spark 2.x的RandomForestClassifier

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-02-14 23:43:14

解决方案2
0 2018-02-15 21:45:29

用于多类分类Spark 2.x的RandomForestClassifier

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-02-14 23:43:14

解决方案2 0 2018-02-15 21:45:29

解决方案1
0 已采纳 2018-02-14 23:43:14

解决方案2
0 2018-02-15 21:45:29