[英]spark dataframe filter function not working
我是 spark 的新手,我們有一個從 hbase 讀取數據並將其保存到 rdd 的項目。 dataframe 計數為 5280000,代碼如下:
val df = spark.createDataFrame(rddDump, schema)
def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
1
}else{
val randomId = random.nextLong(0, 1000000000000L)
var targetValue = 10000/count.toDouble
var base = 1
while (targetValue < 1){
targetValue = targetValue * base
base = base * 10
}
if(randomId % base <= (targetValue.intValue() + 1)) 1 else 0
}
})
val sampleBasedAll = df.withColumn("sampled", sampledOrNot(col("count")))
sampleBasedAll.repartition(10).write.option("header", value = true).option("compression", "gzip").csv("/sampleBasedAll")
val sampledDF = sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc")
scribe.info("sampledDF.count = " + sampledDF.count())
奇怪的是文件夾sampleBasedAll
保存了有效的 csv dataframe 結果,但產品日志顯示的sampledDF.count
為零。
我從sampleBasedAll
文件夾下載 csvs,然后重新運行
sampleBasedAll.repartition(100).filter("sampled = 1").select($"sellerId", $"siteId", $"count", $"desc").count()
它有 13500 條記錄顯示...
我的問題是為什么
sampleBasedAll.filter("sampled = 1")
在本地運行時有記錄,但prod run沒有生成任何記錄...
這篇文章Unexpected behavior of UDF for random integers with join operation給了我提示
“Spark 假設 UDF 是確定性函數”
udf 可以執行多次,通過添加.asNondeterministic()
更新示例 udf,如下所示
def sampledOrNot = udf((count: Int) => {
if(count < TEN_K_SELLER_ITEM_BENCH){
1
}else{
val randomId = random.nextLong(0, 1000000000000L)
var targetValue = 10000/count.toDouble
var base = 1
while (targetValue < 1){
targetValue = targetValue * base
base = base * 10
}
if(randomId % base <= 10000/count.toDouble * base) 1 else 0
}.asNondeterministic()
})
解決不一致問題
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.