Currently I am doing:
val DF = sqlSession.sql("select itemIdDig as itemId, "
+ "title"
+ "timestamp as time "
+ "from itemTable ")
val tempDF = sqlSession.sql("select itemIdDig as itemId "
+ "from itemTable "
+ "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()
//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF : _*)).toDF
But this is very slow. Can someone suggest me any better ways to achieve this? Basically I am looking from rows which is not in tempDF
(I tried with group by having which gives me unique itemId
but I want to preserve the duplicates)
只需半加入:
DF.join(tempDF, Seq("itemId"), "leftanti")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.