[英]Spark SQL DataFrame join with filter is not working
I'm trying to filter df1 by joining df2 based on some column and then filter some rows from df1 based on filter. 我正在尝试通过基于某些列加入df2来过滤df1,然后根据filter来过滤df1中的某些行。
df1: df1:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| STS| green|
|Rapid Cash Plus| green|
| DOTOPAL| green|
| RAPID CASH| green|
df2: df2:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| STS| blue|
|Rapid Cash Plus| blue|
| DOTOPAL| blue|
+---------------+----------+
Sample code is: 示例代码为:
df1.join(df2, df1.col("channel") === df2.col("channel"), "leftouter")
.filter(not(df1.col("rag_status") === "green"))
.select(df1.col("channel"), df1.col("rag_status")).show
Its not returning any records. 它不返回任何记录。
I'm expecting the output as below one, which is returned from df1 after filtering the records based on channel
and green
status condition. 我期望输出如下,它是根据
channel
和green
状态条件过滤记录后从df1返回的。 If the same channel is available in the df2 and the df1 rag_status
is green, then remove that record from df1 and return the remaining records only from df1. 如果df2中有相同的通道可用,而df1的
rag_status
为绿色,则从df1中删除该记录,仅从df1中返回其余记录。
Expected output is: 预期输出为:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| RAPID CASH| green|
You can work something like this : 您可以像这样工作:
val df1=sc.parallelize(Seq(("STS","green"),("Rapid Cash Plus","green"),("RAPID CASH","green"))).toDF("channel","rag_status").where($"rag_status"==="green")
val df2=sc.parallelize(Seq(("STS","blue"),("Rapid Cash Plus","blue"),("DOTOPAL","blue"))).toDF("channel","rag_status").withColumnRenamed("rag_status","rag_status2")
val leftJoinResult=df1.join(df2,Array("channel"),"left")
val innerJoinResult=df1.join(df2,"channel")
val resultDF=leftJoinResult.except(innerJoinResult).drop("rag_status2")
resultDF.show
Spark-shell Output: 火花壳输出:
scala> val df1=sc.parallelize(Seq(("STS","green"),("Rapid Cash Plus","green"),("RAPID CASH","green"))).toDF("channel","rag_status").where($"rag_status"==="green")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [channel: string, rag_status: string]
scala> val df2=sc.parallelize(Seq(("STS","blue"),("Rapid Cash Plus","blue"),("DOTOPAL","blue"))).toDF("channel","rag_status").withColumnRenamed("rag_status","rag_status2")
df2: org.apache.spark.sql.DataFrame = [channel: string, rag_status2: string]
scala> val leftJoinResult=df1.join(df2,Array("channel"),"left")
leftJoinResult: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string ... 1 more field]
scala> val innerJoinResult=df1.join(df2,"channel")
innerJoinResult: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string ... 1 more field]
scala> val resultDF=leftJoinResult.except(innerJoinResult).drop("rag_status2")
resultDF: org.apache.spark.sql.DataFrame = [channel: string, rag_status: string]
scala> resultDF.show
+----------+----------+
| channel|rag_status|
+----------+----------+
|RAPID CASH| green|
+----------+----------+
您可以使用以下代码获得预期的输出:
df1.join(df2, Seq("channel"), "leftouter").filter(row => row(3) != "blue")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.