简体   繁体   English

使用 spark scala 中的其他数据帧过滤一个数据帧

[英]Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.我将使用以下两个数据框来演示我的问题。

val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
                  (2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
    datF1.show()

+---+-----------+-----+
| ID|      token|value|
+---+-----------+-----+
|  1|everlasting| 1.39|
|  1|       game|  2.7|
|  1|       life| 0.69|
|  1|   learning| 0.69|
|  2|     living| 1.38|
|  2|      worth| 1.38|
|  2|       life| 0.69|
|  3|   learning| 0.69|
|  3|      never| 1.38|
+---+-----------+-----+




val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
|  token1|val2|
+--------+----+
|   life |0.71|
|learning|0.75|
+--------+----+

I want to filter the ID and value of dataF1 based on the token1 of dataF2 .我想根据token1dataF2过滤dataF1IDvalue For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.对于token1dataF2的每个单词,如果存在单词标记,则value应等于dataF1的值,否则 value 应为零。 In other words my desired output should be like this换句话说,我想要的输出应该是这样的

    +---+----+----+
| ID| val|val2|
+---+----+----+
|  1|0.69|0.69|
|  2| 0.0|0.69|
|  3|0.69| 0.0|
+---+----+----+

Since learning is not presented in ID equals 2 , the val has equal to zero.由于学习未在 ID 等于 2 中呈现,因此 val 等于 0。 Similarly since life is not there for ID equal 3, val2 equlas zero.同样,由于 ID 等于 3 时生命不存在,因此 val2 等于 0。

I did it manually as follows ,我手动完成如下,

val newQ61=datF1.filter($"token"==="learning")

val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )

val newQ62=datF1.filter($"token"==="life")

val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )

val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()

+---+----+----+
| ID| val|val2|
+---+----+----+
|  1|0.69|0.69|
|  2| 0.0|0.69|
|  3|0.69| 0.0|
+---+----+----+

Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ?有没有办法通过访问另一个数据帧中一个数据帧的索引来更有效地执行此操作,而不是手动执行此操作? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.因为在现实生活中,可能有 2 个以上的单词,因此手动访问每个单词可能非常困难。

Thank you谢谢

UPDATE When i use leftsemi join my output is like this :更新当我使用leftsemi join 我的输出是这样的:

datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID|   token|value|
+---+--------+-----+
|  1|learning| 0.69|
|  3|learning| 0.69|
+---+--------+-----+

I believe a left outer join and then pivoting on token can work here:我相信左外连接然后以token可以在这里工作:

 val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)

The result (without the null handling):结果(没有空处理):

ans.show

+---+--------+----+
| ID|learning|life|
+---+--------+----+
|  1|    0.69|0.69|
|  3|    0.69|0.0 |
|  2|    0.0 |0.69|
+---+--------+----+

UPDATE : as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.更新:正如 Lamanus 的回答所暗示的那样,内连接可能是比外连接 + 过滤器更好的方法。

I think the inner join is enough.我认为inner连接就足够了。 Btw, I found the typo in your test case, which makes the result wrong.顺便说一句,我在您的测试用例中发现了拼写错误,导致结果错误。

val dataF1= Seq((1,"everlasting",1.39),
                (1,"game", 2.7),
                (1,"life",0.69),
                (1,"learning",0.69),
                (2,"living",1.38),
                (2,"worth",1.38),
                (2,"life",0.69),
                (3,"learning",0.69),
                (3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID|      token|value|
// +---+-----------+-----+
// |  1|everlasting| 1.39|
// |  1|       game|  2.7|
// |  1|       life| 0.69|
// |  1|   learning| 0.69|
// |  2|     living| 1.38|
// |  2|      worth| 1.38|
// |  2|       life| 0.69|
// |  3|   learning| 0.69|
// |  3|      never| 1.38|
// +---+-----------+-----+

val dataF2= Seq(("life",0.71), // "life " -> "life"
                ("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// |  token1|val2|
// +--------+----+
// |    life|0.71|
// |learning|0.75|
// +--------+----+

val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID|   token|value|  token1|val2|
// +---+--------+-----+--------+----+
// |  1|    life| 0.69|    life|0.71|
// |  1|learning| 0.69|learning|0.75|
// |  2|    life| 0.69|    life|0.71|
// |  3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+

resultDF.groupBy("ID").pivot("token").agg(first("value"))
  .na.fill(0).orderBy("ID").show

This will give you the result such as这会给你的结果,如

+---+--------+----+
| ID|learning|life|
+---+--------+----+
|  1|    0.69|0.69|
|  2|     0.0|0.69|
|  3|    0.69| 0.0|
+---+--------+----+

Seems like you need "left semi-join".似乎您需要“左半连接”。 It will filter one dataframe, based on another one.它将根据另一个数据帧过滤一个数据帧。 Try using it like尝试使用它

datF1.join(datF2, $"token"===$"token2", "leftsemi")

You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5你可以在这里找到更多信息 - https://medium.com/datamindbe/little-known-spark-dataframe-join-types-cc524ea39fd5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM