[英]how to take lines from a dataframe that are not in another dataframe using spark/scala
I have a dataframe :我有一个数据框:
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| A | A2 |
| A | A2 |
| B | b2
| B | b2 |
| C | c2 |
| D | d2 |
| E | e2 |
| F | f2 |
And another dataframe和另一个数据框
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| A | A2 |
| B | b2 |
| C | c2 |
I want have in result :我想要结果:
+++++++++++++++++++++++
| Col1 | col2 |
|+++++++++++++++++++++ |
| D | d2 |
| E | e2 |
| F | f2 |
I do that :我这样做:
df1.join(df2,Seq("col1","col2"),"left")
But doesn't work for me .但对我不起作用。
Any idea ?任何的想法 ? Thank you .
谢谢你 。
We can use .except
or leftjoin
for this case.对于这种情况,我们可以使用
.except
或leftjoin
。
Example:
df.show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| A| A2|
//| A| A2|
//| B| b2|
//| B| b2|
//| C| c2|
//| D| d2|
//| E| e2|
//| F| f2|
//+----+----+
df1.show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| A| A2|
//| B| b2|
//| C| c2|
//+----+----+
df.except(df1).show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| E| e2|
//| F| f2|
//| D| d2|
//+----+----+
df.alias("d1").join(df1.alias("d2"),
(col("d1.Col1")===col("d2.Col1") &&(col("d1.Col2")===col("d2.Col2"))),"left").
filter(col("d2.Col2").isNull).
select("d1.*").
show()
//+----+----+
//|Col1|Col2|
//+----+----+
//| D| d2|
//| E| e2|
//| F| f2|
//+----+----+
You can use except on both the df.您可以在 df 上使用 except。
scala> df1.except(df2).show
+----+----+
|Col1|col2|
+----+----+
| E| e2|
| F| f2|
| D| d2|
+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.