简体   繁体   English

火花 dataframe 过滤器 function 不工作

[英]spark dataframe filter function not working

I am new to spark, we have a project which reads data from hbase and save it to rdd.我是 spark 的新手,我们有一个从 hbase 读取数据并将其保存到 rdd 的项目。 The dataframe count is 5280000, here is the code: dataframe 计数为 5280000,代码如下:

val df = spark.createDataFrame(rddDump, schema)

def sampledOrNot = udf((count: Int) => {
  if(count < TEN_K_SELLER_ITEM_BENCH){
    1
  }else{
    val randomId = random.nextLong(0, 1000000000000L)
    var targetValue = 10000/count.toDouble
    var base = 1
    while (targetValue < 1){
      targetValue = targetValue * base
      base = base * 10
    }
    if(randomId % base <= (targetValue.intValue() + 1)) 1 else 0
  }
})

val sampleBasedAll = df.withColumn("sampled", sampledOrNot(col("count")))
sampleBasedAll.repartition(10).write.option("header", value = true).option("compression", "gzip").csv("/sampleBasedAll")

val sampledDF = sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc")
scribe.info("sampledDF.count = " + sampledDF.count())

Weird thing is folder sampleBasedAll has valid csv dataframe result saved, but sampledDF.count as prod log showed is zero.奇怪的是文件夹sampleBasedAll保存了有效的 csv dataframe 结果,但产品日志显示的sampledDF.count为零。

I download csvs from sampleBasedAll folder, then rerun我从sampleBasedAll文件夹下载 csvs,然后重新运行

sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc").count()

it has 13500 records showed...它有 13500 条记录显示...

My question is why我的问题是为什么

sampleBasedAll.filter("sampled = 1")

has records when run locally, but prod run didn't generate any records...在本地运行时有记录,但prod run没有生成任何记录...

This post Unexpected behavior of UDF for random integers with join operation gives me the hint这篇文章Unexpected behavior of UDF for random integers with join operation给了我提示

"Spark's assumption that a UDF is a deterministic function" “Spark 假设 UDF 是确定性函数”

udf could be executed more then once, update the sample udf by adding .asNondeterministic() like below udf 可以执行多次,通过添加.asNondeterministic()更新示例 udf,如下所示

def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
  1
}else{
  val randomId = random.nextLong(0, 1000000000000L)
  var targetValue = 10000/count.toDouble
  var base = 1
  while (targetValue < 1){
   targetValue = targetValue * base
   base = base * 10
  }
  if(randomId % base <= 10000/count.toDouble * base) 1 else 0
}.asNondeterministic()
})

Solves the inconsistence issue解决不一致问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM