简体   繁体   English

Spark left_outer连接返回的两个列表中都存在项目

[英]Spark left_outer join is returning items exist in both list

I'm running spark-shell to compare 2 csv files. 我正在运行spark-shell来比较2个csv文件。 Each file has the same number of columns and all have 600,000 rows. 每个文件具有相同的列数,并且全部具有600,000行。 I'm expecting the 2 files have all the same rows. 我期望2个文件具有相同的行。 Here is my script. 这是我的剧本。

val a =
  spark
    .read
    .option("header", "true")
    .option("delimiter", "|")
    .csv("/tmp/1.csv")
    .drop("unwanted_column").
    .cache()

val b = 
   spark
     .read
     .option("header", "true")
     .option("delimiter", "|")
     .csv("/tmp/2.csv")
     .drop("unwanted_column")
     .cache()

val c = a.join(b, Seq("id", "year"), "left_outer").cache()

c.count() // this is returning 600,000

Now I'm trying to find out the difference by randomly picking a line with the same id and year in 2 datasets a and b. 现在,我试图通过在2个数据集a和b中随机选择一条具有相同id和year的行来找出差异。

val a1 = a.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()

val b1 = b.filter(i => i.get(0).equals("1") && i.get(1).equals("2016")).first()

Then I try to compare each column in a1 and b1. 然后,我尝试比较a1和b1中的每一列。

(0 to (a1.length -1)).foreach { i =>
  if (a1.getString(i) != null && !a1.getString(i).equals(b1.getString(i))) {
    System.out.println(i + " = " + a1.getString(i) + " = " + b1.getString(i))
  }
}

It didn't print anything. 它什么也没打印。 In other words, there is no difference. 换句话说,没有区别。

I can't tell why c.count() is returning 600,000 like that. 我不知道为什么c.count()这样返回600,000。

Sorry guys, I guess it was my fault. 对不起,我想这是我的错。 Actually I was after a.subtract(b). 其实我是在a.subtract(b)之后。 My purpose is to find out the difference between a and b. 我的目的是找出a和b之间的差异。 I was confused about left_outer join. 我对left_outer加入感到困惑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM