火花 dataframe 过滤器 function 不工作

Question

I am new to spark, we have a project which reads data from hbase and save it to rdd.我是 spark 的新手，我们有一个从 hbase 读取数据并将其保存到 rdd 的项目。 The dataframe count is 5280000, here is the code: dataframe 计数为 5280000，代码如下：

val df = spark.createDataFrame(rddDump, schema)

def sampledOrNot = udf((count: Int) => {
  if(count < TEN_K_SELLER_ITEM_BENCH){
    1
  }else{
    val randomId = random.nextLong(0, 1000000000000L)
    var targetValue = 10000/count.toDouble
    var base = 1
    while (targetValue < 1){
      targetValue = targetValue * base
      base = base * 10
    }
    if(randomId % base <= (targetValue.intValue() + 1)) 1 else 0
  }
})

val sampleBasedAll = df.withColumn("sampled", sampledOrNot(col("count")))
sampleBasedAll.repartition(10).write.option("header", value = true).option("compression", "gzip").csv("/sampleBasedAll")

val sampledDF = sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc")
scribe.info("sampledDF.count = " + sampledDF.count())

Weird thing is folder sampleBasedAll has valid csv dataframe result saved, but sampledDF.count as prod log showed is zero.奇怪的是文件夹sampleBasedAll保存了有效的 csv dataframe 结果，但产品日志显示的sampledDF.count为零。

I download csvs from sampleBasedAll folder, then rerun我从sampleBasedAll文件夹下载 csvs，然后重新运行

sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc").count()

it has 13500 records showed...它有 13500 条记录显示...

My question is why我的问题是为什么

sampleBasedAll.filter("sampled = 1")

has records when run locally, but prod run didn't generate any records...在本地运行时有记录，但prod run没有生成任何记录...

Answer 1

This post Unexpected behavior of UDF for random integers with join operation gives me the hint这篇文章Unexpected behavior of UDF for random integers with join operation给了我提示

"Spark's assumption that a UDF is a deterministic function" “Spark 假设 UDF 是确定性函数”

udf could be executed more then once, update the sample udf by adding .asNondeterministic() like below udf 可以执行多次，通过添加.asNondeterministic()更新示例 udf，如下所示

def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
  1
}else{
  val randomId = random.nextLong(0, 1000000000000L)
  var targetValue = 10000/count.toDouble
  var base = 1
  while (targetValue < 1){
   targetValue = targetValue * base
   base = base * 10
  }
  if(randomId % base <= 10000/count.toDouble * base) 1 else 0
}.asNondeterministic()
})

Solves the inconsistence issue解决不一致问题

火花 dataframe 过滤器 function 不工作

问题描述

1 个解决方案

解决方案1
0 2023-01-09 02:37:47

火花 dataframe 过滤器 function 不工作

问题描述

1 个解决方案

解决方案1 0 2023-01-09 02:37:47

解决方案1
0 2023-01-09 02:37:47