简体   繁体   中英

Filter dataframe based on another data frame scala

Currently I am doing:

val DF = sqlSession.sql("select itemIdDig as itemId, "
      + "title"
      + "timestamp as time "
      + "from itemTable ")

val tempDF = sqlSession.sql("select itemIdDig as itemId "
      + "from itemTable "
      + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()


//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF  : _*)).toDF

But this is very slow. Can someone suggest me any better ways to achieve this? Basically I am looking from rows which is not in tempDF (I tried with group by having which gives me unique itemId but I want to preserve the duplicates)

只需半加入:

DF.join(tempDF,  Seq("itemId"), "leftanti")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM