简体   繁体   English

带有过滤器的Spark SQL DataFrame连接不起作用

[英]Spark SQL DataFrame join with filter is not working

I'm trying to filter df1 by joining df2 based on some column and then filter some rows from df1 based on filter. 我正在尝试通过基于某些列加入df2来过滤df1,然后根据filter来过滤df1中的某些行。

df1: df1:

+---------------+----------+
|        channel|rag_status|
+---------------+----------+
|            STS|     green|
|Rapid Cash Plus|     green|
|        DOTOPAL|     green|
|     RAPID CASH|     green|

df2: df2:

+---------------+----------+
|        channel|rag_status|
+---------------+----------+
|            STS|      blue|
|Rapid Cash Plus|      blue|
|        DOTOPAL|      blue|
+---------------+----------+

Sample code is: 示例代码为:

df1.join(df2, df1.col("channel") === df2.col("channel"), "leftouter")
      .filter(not(df1.col("rag_status") === "green"))
      .select(df1.col("channel"), df1.col("rag_status")).show

Its not returning any records. 它不返回任何记录。

I'm expecting the output as below one, which is returned from df1 after filtering the records based on channel and green status condition. 我期望输出如下,它是根据channelgreen状态条件过滤记录后从df1返回的。 If the same channel is available in the df2 and the df1 rag_status is green, then remove that record from df1 and return the remaining records only from df1. 如果df2中有相同的通道可用,而df1的rag_status为绿色,则从df1中删除该记录,仅从df1中返回其余记录。

Expected output is: 预期输出为:

+---------------+----------+
|        channel|rag_status|
+---------------+----------+
|     RAPID CASH|     green|

You can work something like this : 您可以像这样工作:

val df1=sc.parallelize(Seq(("STS","green"),("Rapid Cash Plus","green"),("RAPID CASH","green"))).toDF("channel","rag_status").where($"rag_status"==="green")
val df2=sc.parallelize(Seq(("STS","blue"),("Rapid Cash Plus","blue"),("DOTOPAL","blue"))).toDF("channel","rag_status").withColumnRenamed("rag_status","rag_status2")
val leftJoinResult=df1.join(df2,Array("channel"),"left")
val innerJoinResult=df1.join(df2,"channel")
val resultDF=leftJoinResult.except(innerJoinResult).drop("rag_status2")
resultDF.show

Spark-shell Output: 火花壳输出:

scala> val df1=sc.parallelize(Seq(("STS","green"),("Rapid Cash Plus","green"),("RAPID CASH","green"))).toDF("channel","rag_status").where($"rag_status"==="green")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [channel: string, rag_status: string]

scala> val df2=sc.parallelize(Seq(("STS","blue"),("Rapid Cash Plus","blue"),("DOTOPAL","blue"))).toDF("channel","rag_status").withColumnRenamed("rag_status","rag_status2")
df2: org.apache.spark.sql.DataFrame = [channel: string, rag_status2: string]

scala> val leftJoinResult=df1.join(df2,Array("channel"),"left")
leftJoinResult: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string ... 1 more field]

scala> val innerJoinResult=df1.join(df2,"channel")
innerJoinResult: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string ... 1 more field]

scala> val resultDF=leftJoinResult.except(innerJoinResult).drop("rag_status2")
resultDF: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string]

scala> resultDF.show
+----------+----------+                                                         
|   channel|rag_status|
+----------+----------+
|RAPID CASH|     green|
+----------+----------+

您可以使用以下代码获得预期的输出:

df1.join(df2, Seq("channel"), "leftouter").filter(row => row(3) != "blue")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM