Filter dataframe based on another data frame scala

Question

Currently I am doing:

val DF = sqlSession.sql("select itemIdDig as itemId, "
      + "title"
      + "timestamp as time "
      + "from itemTable ")

val tempDF = sqlSession.sql("select itemIdDig as itemId "
      + "from itemTable "
      + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()


//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF  : _*)).toDF

But this is very slow. Can someone suggest me any better ways to achieve this? Basically I am looking from rows which is not in tempDF (I tried with group by having which gives me unique itemId but I want to preserve the duplicates)

Answer 1

只需半加入：

DF.join(tempDF,  Seq("itemId"), "leftanti")

Filter dataframe based on another data frame scala

Question

1 answers

solution1
2 ACCPTED 2018-01-16 22:46:00

Filter dataframe based on another data frame scala

Question

1 answers

solution1 2 ACCPTED 2018-01-16 22:46:00

solution1
2 ACCPTED 2018-01-16 22:46:00