Spark DataFrame 在 OneHotEncoder 中處理空字符串

Question

我正在將 CSV 文件（使用 spark-csv）導入到具有空String值的DataFrame中。 應用OneHotEncoder ，應用程序崩潰並出現錯誤requirement failed: Cannot have an empty string for name. . 有沒有辦法解決這個問題？

我可以重現Spark ml頁面上提供的示例中的錯誤：

val df = sqlContext.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, ""),         //<- original example has "a" here
  (4, "a"),
  (5, "c")
)).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

encoded.show()

這很煩人，因為缺失/空值是一種高度通用的情況。

提前致謝， Nikhil

Answer 1

由於OneHotEncoder / OneHotEncoderEstimator不接受空字符串作為名稱，否則您將收到以下錯誤：

java.lang.IllegalArgumentException：要求失敗：名稱不能有空字符串。 在 org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33) 在 org.apache.spark.ml.attribute 的 scala.Predef$.require(Predef.scala:233)。屬性$$anonfun$5.apply(attributes.scala:32) [...]

這就是我要做的：（還有其他方法可以做到，rf。@Anthony 的回答）

我將創建一個UDF來處理空類別：

import org.apache.spark.sql.functions._

def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }

然后，我將在列上應用 UDF：

val df = sqlContext.createDataFrame(Seq(
   (0, "a"),
   (1, "b"),
   (2, "c"),
   (3, ""),         //<- original example has "a" here
   (4, "a"),
   (5, "c")
)).toDF("id", "category")
  .withColumn("category",processMissingCategory('category))

df.show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

現在，你可以回到你的轉變

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// |  0|       a|          0.0|
// |  1|       b|          2.0|
// |  2|       c|          1.0|
// |  3|      NA|          3.0|
// |  4|       a|          0.0|
// |  5|       c|          1.0|
// +---+--------+-------------+

// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)

encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

編輯：

@Anthony 在Scala 中的解決方案：

df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

我希望這有幫助！

Answer 2

是的，這有點棘手，但也許您可以將空字符串替換為肯定與其他值不同的內容。 請注意，我使用的是 pyspark DataFrameNaFunctions API，但Scala應該類似。

df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  3|   EMPTY|
|  4|       a|
|  5|       c|
+---+--------+

Answer 3

如果該列包含 null，OneHotEncoder 將失敗並返回 NullPointerException。 因此我也將 udf 擴展為 tanslate null 值

object OneHotEncoderExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    // $example on$
    val df1 = sqlContext.createDataFrame(Seq(
      (0.0, "a"),
      (1.0, "b"),
      (2.0, "c"),
      (3.0, ""),
      (4.0, null),
      (5.0, "c")
    )).toDF("id", "category")


    import org.apache.spark.sql.functions.udf
    def emptyValueSubstitution = udf[String, String] {
      case "" => "NA"
      case null => "null"
      case value => value
    }
    val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )


    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)
    indexed.show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
      .setDropLast(false)
    val encoded = encoder.transform(indexed)
    encoded.show()
    // $example off$
    sc.stop()
  }
}

Spark DataFrame 在 OneHotEncoder 中處理空字符串

問題描述

3 個解決方案

解決方案1
8 已采納 2016-01-13 10:53:03

解決方案2
5 2015-10-19 18:32:17

解決方案3
0 2016-04-21 13:24:17

Spark DataFrame 在 OneHotEncoder 中處理空字符串

問題描述

3 個解決方案

解決方案1 8 已采納 2016-01-13 10:53:03

解決方案2 5 2015-10-19 18:32:17

解決方案3 0 2016-04-21 13:24:17

解決方案1
8 已采納 2016-01-13 10:53:03

解決方案2
5 2015-10-19 18:32:17

解決方案3
0 2016-04-21 13:24:17