[英]spark dataframe filter function not working
I am new to spark, we have a project which reads data from hbase and save it to rdd.我是 spark 的新手,我们有一个从 hbase 读取数据并将其保存到 rdd 的项目。 The dataframe count is 5280000, here is the code: dataframe 计数为 5280000,代码如下:
val df = spark.createDataFrame(rddDump, schema)
def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
1
}else{
val randomId = random.nextLong(0, 1000000000000L)
var targetValue = 10000/count.toDouble
var base = 1
while (targetValue < 1){
targetValue = targetValue * base
base = base * 10
}
if(randomId % base <= (targetValue.intValue() + 1)) 1 else 0
}
})
val sampleBasedAll = df.withColumn("sampled", sampledOrNot(col("count")))
sampleBasedAll.repartition(10).write.option("header", value = true).option("compression", "gzip").csv("/sampleBasedAll")
val sampledDF = sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc")
scribe.info("sampledDF.count = " + sampledDF.count())
Weird thing is folder sampleBasedAll
has valid csv dataframe result saved, but sampledDF.count
as prod log showed is zero.奇怪的是文件夹sampleBasedAll
保存了有效的 csv dataframe 结果,但产品日志显示的sampledDF.count
为零。
I download csvs from sampleBasedAll
folder, then rerun我从sampleBasedAll
文件夹下载 csvs,然后重新运行
sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc").count()
it has 13500 records showed...它有 13500 条记录显示...
My question is why我的问题是为什么
sampleBasedAll.filter("sampled = 1")
has records when run locally, but prod run didn't generate any records...在本地运行时有记录,但prod run没有生成任何记录...
This post Unexpected behavior of UDF for random integers with join operation gives me the hint这篇文章Unexpected behavior of UDF for random integers with join operation给了我提示
"Spark's assumption that a UDF is a deterministic function" “Spark 假设 UDF 是确定性函数”
udf could be executed more then once, update the sample udf by adding .asNondeterministic()
like below udf 可以执行多次,通过添加.asNondeterministic()
更新示例 udf,如下所示
def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
1
}else{
val randomId = random.nextLong(0, 1000000000000L)
var targetValue = 10000/count.toDouble
var base = 1
while (targetValue < 1){
targetValue = targetValue * base
base = base * 10
}
if(randomId % base <= 10000/count.toDouble * base) 1 else 0
}.asNondeterministic()
})
Solves the inconsistence issue解决不一致问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.