简体   繁体   English

Spark:在Scala中对数据框进行动态过滤聚合

[英]Spark: Aggregation with dynamic filter on a dataframe in scala

I have a dataframe like 我有一个数据框

scala> testDf.show()
+------+--------+---------+------------+----------------------------------------+
|    id|    item|    value|  value_name|                               condition|
+------+--------+---------+------------+----------------------------------------+
|    11|    3210|        0|         OFF|                                value==0|
|    12|    3210|        1|         OFF|                                value==0|
|    13|    3210|        0|         OFF|                                value==0|
|    14|    3210|        0|         OFF|                                value==0|
|    15|    3210|        1|         OFF|                                value==0|
|    16|    5440|        5|          ON|                     value>0 && value<10|
|    17|    5440|        0|          ON|                     value>0 && value<10|
|    18|    5440|        6|          ON|                     value>0 && value<10|
|    19|    5440|        7|          ON|                     value>0 && value<10|
|    20|    5440|        0|          ON|                     value>0 && value<10|
|    21|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    22|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    23|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    24|    7780|        C|        TYPE|   Set("A","B").contains(value.toString)|
|    25|    7780|        C|        TYPE|   Set("A","B").contains(value.toString)|
+------+--------+---------+------------+----------------------------------------+

scala> testDf.printSchema
root
 |-- id: string (nullable = true)
 |-- item: string (nullable = true)
 |-- value: string (nullable = true)
 |-- value_name: string (nullable = true)
 |-- condition: string (nullable = true)

I want to remove some rows with 'condition' column. 我想删除一些带有“条件”列的行。 But I am in trouble. 但是我有麻烦了。

I tried with below test code. 我尝试使用下面的测试代码。 But it does not seem to work properly. 但是它似乎无法正常工作。

import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.Row
import scala.collection.mutable

val encoder = RowEncoder(testDf.schema);

testDf.flatMap(row => {
  val result = new mutable.MutableList[Row];
  val setting_value = row.getAs[String]("setting_value").toInt
  val condition = row.getAs[String]("condition").toBoolean
  if (condition){
      result+=row;
  };
  result;
})(encoder).show();

And This is error. 而且这是错误的。

19/05/30 02:04:31 ERROR TaskSetManager: Task 0 in stage 267.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 267.0 failed 4 times, most recent failure: Lost task 0.3 in stage 267.0 (TID 3763, .compute.internal, executor 1): java.lang.IllegalArgumentException: For input string: "setting_value==0"
        at scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:291)
        at scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:261)
        at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
        at $anonfun$1.apply(<console>:40)
        at $anonfun$1.apply(<console>:37)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I want to keep rows that match the value of the condition column. 我想保留与条件列的值匹配的行。 This is the desired result. 这是期望的结果。

+------+--------+---------+------------+----------------------------------------+
|    id|    item|    value|  value_name|                               condition|
+------+--------+---------+------------+----------------------------------------+
|    11|    3210|        0|         OFF|                                value==0|
|    13|    3210|        0|         OFF|                                value==0|
|    14|    3210|        0|         OFF|                                value==0|
|    16|    5440|        5|          ON|                     value>0 && value<10|
|    18|    5440|        6|          ON|                     value>0 && value<10|
|    19|    5440|        7|          ON|                     value>0 && value<10|
|    21|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    22|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
|    23|    7780|        A|        TYPE|   Set("A","B").contains(value.toString)|
+------+--------+---------+------------+----------------------------------------+

Please help me if you have a good idea. 如果您有个好主意,请帮助我。 Thanks. 谢谢。

In the above case, Spark is trying to convert the String value to Boolean. 在上述情况下,Spark尝试将String值转换为Boolean。 It is not evaluating the expression itself. 它不评估表达式本身。
And expression evaluation has to be done by the user using an external library or custom code. 用户必须使用外部库或自定义代码来完成表达式评估。
The closest(Not the exact scenario though) I could come up is 我能想到的最接近的(虽然不是确切的情况)是
How to evaluate a math expression given in string form? 如何计算以字符串形式给出的数学表达式? .

Here is one way using scala reflection API withing a UDF function. 这是将Scala 反射 API与UDF函数结合使用的一种方法。 The udf handles both cases for int and string values: udf处理int和string值的两种情况:

import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox

val tb = currentMirror.mkToolBox()

val df = Seq(("0","value==0"),
("1", "value==0"),
("6", """value>0 && value<10"""),
("7", """value>0 && value<10"""),
("0", """value>0 && value<10"""),
("A", """Set("A","B").contains(value.toString)"""),
("C", """Set("A","B").contains(value.toString)""")).toDF("value", "condition")

def isAllDigits(x: String) = x.forall(Character.isDigit)

val evalExpressionUDF = udf((value: String, expr: String) => {
  val result =  isAllDigits(value) match {
    case true => tb.eval(tb.parse(expr.replace("value", s"""${value.toInt}""")))
    case false => tb.eval(tb.parse(expr.replace("value", s""""${value}"""")))
  }

  result.asInstanceOf[Boolean]
})

df.withColumn("eval", evalExpressionUDF($"value", $"condition"))
  .where($"eval" === true)
  .show(false)

Cases for evalExpressionUDF : evalExpressionUDF案例:

  • int: replace expression with the actual int value then execute string code with mkToolBox int:将表达式替换为实际的int值,然后使用mkToolBox执行字符串代码
  • string: enclose string value into "" then replace expression with the double quoted string and execute the string code string:将字符串值包含在""然后将表达式替换为双引号字符串并执行字符串代码

Output: 输出:

+-----+-------------------------------------+----+ 
|value|                           condition |eval| 
+-----+-------------------------------------+----+ 
|0    |value==0                             |true| 
|6    |value>0 && value<10                  |true| 
|7    |value>0 && value<10                  |true| 
|A    |Set("A","B").contains(value.toString)|true| 
+-----+-------------------------------------+----+

PS: I know that the performance of the above solution may be bad since it invokes reflection although I am not aware of an alternative. PS:我知道上述解决方案的性能可能很差,因为尽管我不知道有什么替代方案,但它会引起反射。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM