如何在Spark ML中為分類創建正確的數據框

Question

我正在嘗試通過使用Spark ML API運行隨機森林分類，但在將正確的數據幀輸入管道中時遇到問題。

這是示例數據：

age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"

age和hours_per_week是整數，而其他功能（包括標簽薪水）是分類的（字符串）

可以通過Spark csv庫完成加載此csv文件（將其稱為sample.csv），如下所示：

val data = sqlContext.csvFile("/home/dusan/sample.csv")

默認情況下，所有列均作為字符串導入，因此我們需要將“ age”和“ hours_per_week”更改為Int：

val toInt    = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))

只是要檢查架構現在的外觀：

scala> dataFixed.printSchema
root
 |-- age: integer (nullable = true)
 |-- hours_per_week: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- salaryRange: string (nullable = true)

然后讓我們設置交叉驗證器和管道：

val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf)) 
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)

運行此行時顯示錯誤：

val cmModel = cv.fit(dataFixed)

java.lang.IllegalArgumentException：字段“功能”不存在。

可以在RandomForestClassifier中設置標簽列和特征列，但是我有4列作為預測變量（特征），而不僅僅是一個。

如何組織數據框，使其具有正確組織的標簽和功能列？

為了您的方便，這里是完整代碼：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.{Vector, Vectors}


object SampleClassification {

  def main(args: Array[String]): Unit = {

    //set spark context
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import com.databricks.spark.csv._

    //load data by using databricks "Spark CSV Library" 
    val data = sqlContext.csvFile("/home/dusan/sample.csv")

    //by default all columns are imported as string so we need to change "age" and  "hours_per_week" to Int
    val toInt    = udf[Int, String]( _.toInt)
    val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))


    val rf = new RandomForestClassifier()

    val pipeline = new Pipeline().setStages(Array(rf))

    val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)

    // this fails with error
    //java.lang.IllegalArgumentException: Field "features" does not exist.
    val cmModel = cv.fit(dataFixed) 
  }

}

感謝幫助！

Answer 1

從Spark 1.4開始，您可以使用Transformer org.apache.spark.ml.feature.VectorAssembler 。 只需提供要成為功能的列名即可。

val assembler = new VectorAssembler()
  .setInputCols(Array("col1", "col2", "col3"))
  .setOutputCol("features")

並將其添加到您的管道中。

Answer 2

您只需要確保數據VectorUDF類型的"features"列如下所示：

scala> val df2 = dataFixed.withColumnRenamed("age", "features")
df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]

scala> val cmModel = cv.fit(df2) 
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually IntegerType.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
    at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
    at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
    at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
    at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
    at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
    at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)

編輯1

本質上，數據框中的特征矢量必須有兩個字段，“特征”用於特征向量，“標簽”用於實例標簽。 實例的類型必須為Double 。

要使用Vector類型創建“功能”字段，請首先創建一個udf ，如下所示：

val toVec4    = udf[Vector, Int, Int, String, String] { (a,b,c,d) => 
  val e3 = c match {
    case "hs-grad" => 0
    case "bachelors" => 1
    case "masters" => 2
  }
  val e4 = d match {case "male" => 0 case "female" => 1}
  Vectors.dense(a, b, e3, e4) 
}

現在還要對“標簽”字段進行編碼，請創建另一個udf ，如下所示：

val encodeLabel    = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )

現在，我們使用這兩個udf轉換原始數據幀：

val df = dataFixed.withColumn(
  "features",
  toVec4(
    dataFixed("age"),
    dataFixed("hours_per_week"),
    dataFixed("education"),
    dataFixed("sex")
  )
).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")

請注意，數據框中可能存在額外的列/字段，但是在這種情況下，我僅選擇features和label ：

scala> df.show()
+-------------------+-----+
|           features|label|
+-------------------+-----+
|[38.0,40.0,0.0,0.0]|  0.0|
|[28.0,40.0,1.0,1.0]|  0.0|
|[52.0,45.0,0.0,0.0]|  1.0|
|[31.0,50.0,2.0,1.0]|  1.0|
|[42.0,40.0,1.0,0.0]|  1.0|
+-------------------+-----+

現在由您決定為學習算法設置正確的參數以使其起作用。

Answer 3

根據mllib上的spark文檔-隨機樹，在我看來，您應該定義正在使用的功能圖，並且這些點應該是有標記的點。

這將告訴算法應將哪一列用作預測，哪些是特征。

https://spark.apache.org/docs/latest/mllib-decision-tree.html

如何在Spark ML中為分類創建正確的數據框

問題描述

3 個解決方案

解決方案1
45 2015-07-10 14:13:23

解決方案2
33 已采納 2015-06-28 16:44:57

解決方案3
0 2015-06-24 15:07:07

如何在Spark ML中為分類創建正確的數據框

問題描述

3 個解決方案

解決方案1 45 2015-07-10 14:13:23

解決方案2 33 已采納 2015-06-28 16:44:57

解決方案3 0 2015-06-24 15:07:07

解決方案1
45 2015-07-10 14:13:23

解決方案2
33 已采納 2015-06-28 16:44:57

解決方案3
0 2015-06-24 15:07:07