簡體   English   中英

Spark:在Scala中對數據框進行動態過濾聚合

[英]Spark: Aggregation with dynamic filter on a dataframe in scala

我有一個數據框

scala> testDf.show()
+------+--------+---------+------------+----------------------------------------+
|    id|    item|    value|  value_name|                               condition|
+------+--------+---------+------------+----------------------------------------+
|    11|    3210|        0|         OFF|                                value==0|
|    12|    3210|        1|         OFF|                                value==0|
|    13|    3210|        0|         OFF|                                value==0|
|    14|    3210|        0|         OFF|                                value==0|
|    15|    3210|        1|         OFF|                                value==0|
|    16|    5440|        5|          ON|                     value>0 && value<10|
|    17|    5440|        0|          ON|                     value>0 && value<10|
|    18|    5440|        6|          ON|                     value>0 && value<10|
|    19|    5440|        7|          ON|                     value>0 && value<10|
|    20|    5440|        0|          ON|                     value>0 && value<10|
|    21|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    22|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    23|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    24|    7780|        C|        TYPE|   Set("A","B").contains(value.toString)|
|    25|    7780|        C|        TYPE|   Set("A","B").contains(value.toString)|
+------+--------+---------+------------+----------------------------------------+

scala> testDf.printSchema
root
 |-- id: string (nullable = true)
 |-- item: string (nullable = true)
 |-- value: string (nullable = true)
 |-- value_name: string (nullable = true)
 |-- condition: string (nullable = true)

我想刪除一些帶有“條件”列的行。 但是我有麻煩了。

我嘗試使用下面的測試代碼。 但是它似乎無法正常工作。

import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.Row
import scala.collection.mutable

val encoder = RowEncoder(testDf.schema);

testDf.flatMap(row => {
  val result = new mutable.MutableList[Row];
  val setting_value = row.getAs[String]("setting_value").toInt
  val condition = row.getAs[String]("condition").toBoolean
  if (condition){
      result+=row;
  };
  result;
})(encoder).show();

而且這是錯誤的。

19/05/30 02:04:31 ERROR TaskSetManager: Task 0 in stage 267.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 267.0 failed 4 times, most recent failure: Lost task 0.3 in stage 267.0 (TID 3763, .compute.internal, executor 1): java.lang.IllegalArgumentException: For input string: "setting_value==0"
        at scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:291)
        at scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:261)
        at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
        at $anonfun$1.apply(<console>:40)
        at $anonfun$1.apply(<console>:37)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

我想保留與條件列的值匹配的行。 這是期望的結果。

+------+--------+---------+------------+----------------------------------------+
|    id|    item|    value|  value_name|                               condition|
+------+--------+---------+------------+----------------------------------------+
|    11|    3210|        0|         OFF|                                value==0|
|    13|    3210|        0|         OFF|                                value==0|
|    14|    3210|        0|         OFF|                                value==0|
|    16|    5440|        5|          ON|                     value>0 && value<10|
|    18|    5440|        6|          ON|                     value>0 && value<10|
|    19|    5440|        7|          ON|                     value>0 && value<10|
|    21|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    22|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    23|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
+------+--------+---------+------------+----------------------------------------+

如果您有個好主意,請幫助我。 謝謝。

在上述情況下,Spark嘗試將String值轉換為Boolean。 它不評估表達式本身。
用戶必須使用外部庫或自定義代碼來完成表達式評估。
我能想到的最接近的(雖然不是確切的情況)是
如何計算以字符串形式給出的數學表達式?

這是將Scala 反射 API與UDF函數結合使用的一種方法。 udf處理int和string值的兩種情況:

import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox

val tb = currentMirror.mkToolBox()

val df = Seq(("0","value==0"),
("1", "value==0"),
("6", """value>0 && value<10"""),
("7", """value>0 && value<10"""),
("0", """value>0 && value<10"""),
("A", """Set("A","B").contains(value.toString)"""),
("C", """Set("A","B").contains(value.toString)""")).toDF("value", "condition")

def isAllDigits(x: String) = x.forall(Character.isDigit)

val evalExpressionUDF = udf((value: String, expr: String) => {
  val result =  isAllDigits(value) match {
    case true => tb.eval(tb.parse(expr.replace("value", s"""${value.toInt}""")))
    case false => tb.eval(tb.parse(expr.replace("value", s""""${value}"""")))
  }

  result.asInstanceOf[Boolean]
})

df.withColumn("eval", evalExpressionUDF($"value", $"condition"))
  .where($"eval" === true)
  .show(false)

evalExpressionUDF案例:

  • int:將表達式替換為實際的int值,然后使用mkToolBox執行字符串代碼
  • string:將字符串值包含在""然后將表達式替換為雙引號字符串並執行字符串代碼

輸出:

+-----+-------------------------------------+----+ 
|value|                           condition |eval| 
+-----+-------------------------------------+----+ 
|0    |value==0                             |true| 
|6    |value>0 && value<10                  |true| 
|7    |value>0 && value<10                  |true| 
|A    |Set("A","B").contains(value.toString)|true| 
+-----+-------------------------------------+----+

PS:我知道上述解決方案的性能可能很差,因為盡管我不知道有什么替代方案,但它會引起反射。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM