[英]filter count a spark dataframe
我有兩個如下數據幀,我從 MySQL 表中讀取邏輯 DF
邏輯 DF:
slNo | filterCondtion |
-----------------------
1 | age > 100 |
2 | age > 50 |
3 | age > 10 |
4 | age > 20 |
InputDF - 從文件中讀取:
age | name |
------------------------
11 | suraj |
22 | surjeth |
33 | sam |
43 | ram |
我想從邏輯數據框中應用過濾器語句並添加這些過濾器的計數
結果 output:
slNo | filterCondtion | count |
------------------------------
1 | age > 100 | 10 |
2 | age > 50 | 2 |
3 | age > 10 | 5 |
4 | age > 20 | 6 |
-------------------------------
我嘗試過的代碼:
val LogicDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/testDB").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "logic_table").option("user", "root").option("password", "password").load()
def filterCount(str: String): Long ={
val counte = inputDF.where(str).count()
counte
}
val filterCountUDF = udf[Long, String](filterCount)
LogicDF.withColumn("count",filterCountUDF(col("filterCondtion")))
錯誤跟蹤:
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => bigint)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.where(Dataset.scala:1525)
at filterCount(<console>:28)
at $anonfun$1.apply(<console>:25)
at $anonfun$1.apply(<console>:25)
... 21 more
任何替代方案也可以...提前致謝。
只要您的 logicDF 小到可以收集到驅動程序中,這將起作用。
將您的邏輯收集到Array[(Int, String)]
中,如下所示:
val rules = logicDF.collect().map{ r: Row =>
val slNo = r.getAs[Int](0)
val condition = r.getAs[String](1)
(slNo, condition)
}
使用條件值構建一個新列,將這些規則鏈接到 when Column
中。 為此,請使用一些 scala 循環,例如:
val unused = when(lit(false), lit(false))
val filters: Column = rules.foldLeft(unused){
case (acc: Column, (slNo: Int, cond: String)) =>
acc.when(col("slNo") === slNo, expr(cond))
}
//You will get something like:
//when(col("slNo") === 1, expr("age > 10"))
//.when(col("slNo") === 2, expr("age > 20"))
//...
通過連接獲取兩個 DataFrame 的笛卡爾積,因此您可以將每個規則應用於數據中的每一行:
val joinDF = logicDF.join(inputDF, lit(true), "inner") //inner or whatever
使用帶有條件過濾器的前一Column
進行過濾。
val withRulesDF = joinDF.filter(filters)
分組和計數:
val resultDF = withRulesDF
.groupBy("slNo", "filterCondtion")
.agg(count("*") as "count")
package spark
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object LogicFilterDataFrame extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class LogicFilter(slNo: Int, filterCondition: String)
case class Data(age: Int, name:String)
val logicDF = Seq(
LogicFilter(1, "age > 100"),
LogicFilter(2, "age > 50"),
LogicFilter(3, "age > 10"),
LogicFilter(4, "age > 20")
).toDF()
val dataDF = Seq(
Data(11, "suraj"),
Data(22, "surjeth"),
Data(33, "sam"),
Data(43, "ram")
).toDF()
val logicCount = udf{s: String => {
dataDF.filter(s).count()
}}
val resDF = logicDF.filter('filterCondition.like("%age%")).withColumn("count", logicCount('filterCondition))
resDF.show(false)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.