简体   繁体   中英

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.

val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
                  (2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
    datF1.show()

+---+-----------+-----+
| ID|      token|value|
+---+-----------+-----+
|  1|everlasting| 1.39|
|  1|       game|  2.7|
|  1|       life| 0.69|
|  1|   learning| 0.69|
|  2|     living| 1.38|
|  2|      worth| 1.38|
|  2|       life| 0.69|
|  3|   learning| 0.69|
|  3|      never| 1.38|
+---+-----------+-----+




val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
|  token1|val2|
+--------+----+
|   life |0.71|
|learning|0.75|
+--------+----+

I want to filter the ID and value of dataF1 based on the token1 of dataF2 . For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero. In other words my desired output should be like this

    +---+----+----+
| ID| val|val2|
+---+----+----+
|  1|0.69|0.69|
|  2| 0.0|0.69|
|  3|0.69| 0.0|
+---+----+----+

Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.

I did it manually as follows ,

val newQ61=datF1.filter($"token"==="learning")

val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )

val newQ62=datF1.filter($"token"==="life")

val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )

val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()

+---+----+----+
| ID| val|val2|
+---+----+----+
|  1|0.69|0.69|
|  2| 0.0|0.69|
|  3|0.69| 0.0|
+---+----+----+

Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.

Thank you

UPDATE When i use leftsemi join my output is like this :

datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID|   token|value|
+---+--------+-----+
|  1|learning| 0.69|
|  3|learning| 0.69|
+---+--------+-----+

I believe a left outer join and then pivoting on token can work here:

 val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)

The result (without the null handling):

ans.show

+---+--------+----+
| ID|learning|life|
+---+--------+----+
|  1|    0.69|0.69|
|  3|    0.69|0.0 |
|  2|    0.0 |0.69|
+---+--------+----+

UPDATE : as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.

I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.

val dataF1= Seq((1,"everlasting",1.39),
                (1,"game", 2.7),
                (1,"life",0.69),
                (1,"learning",0.69),
                (2,"living",1.38),
                (2,"worth",1.38),
                (2,"life",0.69),
                (3,"learning",0.69),
                (3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID|      token|value|
// +---+-----------+-----+
// |  1|everlasting| 1.39|
// |  1|       game|  2.7|
// |  1|       life| 0.69|
// |  1|   learning| 0.69|
// |  2|     living| 1.38|
// |  2|      worth| 1.38|
// |  2|       life| 0.69|
// |  3|   learning| 0.69|
// |  3|      never| 1.38|
// +---+-----------+-----+

val dataF2= Seq(("life",0.71), // "life " -> "life"
                ("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// |  token1|val2|
// +--------+----+
// |    life|0.71|
// |learning|0.75|
// +--------+----+

val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID|   token|value|  token1|val2|
// +---+--------+-----+--------+----+
// |  1|    life| 0.69|    life|0.71|
// |  1|learning| 0.69|learning|0.75|
// |  2|    life| 0.69|    life|0.71|
// |  3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+

resultDF.groupBy("ID").pivot("token").agg(first("value"))
  .na.fill(0).orderBy("ID").show

This will give you the result such as

+---+--------+----+
| ID|learning|life|
+---+--------+----+
|  1|    0.69|0.69|
|  2|     0.0|0.69|
|  3|    0.69| 0.0|
+---+--------+----+

Seems like you need "left semi-join". It will filter one dataframe, based on another one. Try using it like

datF1.join(datF2, $"token"===$"token2", "leftsemi")

You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM