[英]Spark left_outer join is returning items exist in both list
I'm running spark-shell to compare 2 csv files. 我正在运行spark-shell来比较2个csv文件。 Each file has the same number of columns and all have 600,000 rows.
每个文件具有相同的列数,并且全部具有600,000行。 I'm expecting the 2 files have all the same rows.
我期望2个文件具有相同的行。 Here is my script.
这是我的剧本。
val a =
spark
.read
.option("header", "true")
.option("delimiter", "|")
.csv("/tmp/1.csv")
.drop("unwanted_column").
.cache()
val b =
spark
.read
.option("header", "true")
.option("delimiter", "|")
.csv("/tmp/2.csv")
.drop("unwanted_column")
.cache()
val c = a.join(b, Seq("id", "year"), "left_outer").cache()
c.count() // this is returning 600,000
Now I'm trying to find out the difference by randomly picking a line with the same id and year in 2 datasets a and b. 现在,我试图通过在2个数据集a和b中随机选择一条具有相同id和year的行来找出差异。
val a1 = a.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()
val b1 = b.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()
Then I try to compare each column in a1 and b1. 然后,我尝试比较a1和b1中的每一列。
(0 to (a1.length -1)).foreach { i =>
if (a1.getString(i) != null && !a1.getString(i).equals(b1.getString(i))) {
System.out.println(i + " = " + a1.getString(i) + " = " + b1.getString(i))
}
}
It didn't print anything. 它什么也没打印。 In other words, there is no difference.
换句话说,没有区别。
I can't tell why c.count() is returning 600,000 like that. 我不知道为什么c.count()这样返回600,000。
Sorry guys, I guess it was my fault. 对不起,我想这是我的错。 Actually I was after a.subtract(b).
其实我是在a.subtract(b)之后。 My purpose is to find out the difference between a and b.
我的目的是找出a和b之间的差异。 I was confused about left_outer join.
我对left_outer加入感到困惑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.