I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
(2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
datF1.show()
+---+-----------+-----+
| ID| token|value|
+---+-----------+-----+
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
+---+-----------+-----+
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
| token1|val2|
+--------+----+
| life |0.71|
|learning|0.75|
+--------+----+
I want to filter the ID
and value
of dataF1
based on the token1
of dataF2
. For the each word in token1
of dataF2
, if there is a word token then value
should be equal to the value of dataF1
else value should be zero. In other words my desired output should be like this
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
UPDATE When i use leftsemi
join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID| token|value|
+---+--------+-----+
| 1|learning| 0.69|
| 3|learning| 0.69|
+---+--------+-----+
I believe a left outer join and then pivoting on token
can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)
The result (without the null handling):
ans.show
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
+---+--------+----+
UPDATE : as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner
join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
(1,"life",0.69),
(1,"learning",0.69),
(2,"living",1.38),
(2,"worth",1.38),
(2,"life",0.69),
(3,"learning",0.69),
(3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
resultDF.groupBy("ID").pivot("token").agg(first("value"))
.na.fill(0).orderBy("ID").show
This will give you the result such as
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
+---+--------+----+
Seems like you need "left semi-join". It will filter one dataframe, based on another one. Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.