如何使用scala + spark比较两个rdds？两者都没有钥匙？

Question

I want to compare data in two RDDs. 我想比较两个RDD中的数据。 How can I iterate and compare field data in one RDD with field data in another RDD. 如何迭代和比较一个RDD中的字段数据与另一个RDD中的字段数据。 below Eg files:` 在Eg文件下面：

File1 
 f1  f2       f3    f4    f5      f6  f7
 1 Nancyxyz 23456 12:30 NEWYORK 9000 xyz 
 2 ranboxys 12345 12:30 NEWYORK 9000 xyz

 File2
 f1  f2       f3    f4    f5      f6  f7
 2 ranboxys 12345 12:30 NEWYORK 9000 xyz
 1 markalan 23456 12:30 LONDON  7000 xyz 
 3 Loyleeie 45678 12:40 London  9001 abc

In the above both files having 1st 2 records are same but the sequential order is different. 在上面，具有第1 2条记录的两个文件都相同，但顺序不同。 Now i want to compare both the rdds and print only differ record ie, 现在我想比较rdds和只打印不同的记录，即

 File2
 3 Loyleeie 45678 12:40 London  9001 abc

I dont want to get first 2 records in both the rdds because both are same but order is different Can you please explain how to do that with using rdds in scala 我不想在两个rdds中都获得前2个记录，因为两者相同，但是顺序不同。请您解释一下如何在scala中使用rdds来做到这一点。

I tried somany options like subtract and while loop. 我尝试了许多选项，例如减法和while循环。 but no luck 但没有运气

I just changed in "file2" 2nd record now i want to print 2nd record and 3rd record in file2 and modified fields. 我刚刚在“ file2”第二记录中更改了，现在我想在file2和修改后的字段中打印第二记录和第三记录。 I dont know which field is changed , it just compare file1 if it is not matched then print differ records and print in another line what are the fields are changed 我不知道哪个字段被更改，它只是比较file1如果不匹配，则打印不同的记录并在另一行中打印哪些字段已更改

Answer 1

Assuming that File1 and File2 are of type : RDD[String] , following operation will contain all elements in File2 but not in File1 假设File1和File2的类型的： RDD[String] ，以下的操作将包含在所有元素File2而不是在File1

scala> val File1 = spark.sparkContext.textFile("File1.txt")

scala> val File2 = spark.sparkContext.textFile("File2.txt")

scala> File2.subtract(File1).collect
res0: Array[String] = Array(" 3 Loyleeie 45678 12:40 London  9001 abc")

Here name is the 2nd field in the string (trim the space initially) 这里的name是字符串中的第二个字段（最初修剪空格）

scala> File2.subtract(File1).map { x => x.split(" ")(2) }.collect
res1: Array[String] = Array(Loyleeie)

if tab is your seperator, replace it accordingly 如果tab是您的分隔符，请相应地替换它

如何使用scala + spark比较两个rdds？两者都没有钥匙？

问题描述

1 个解决方案

解决方案1
3 2016-11-17 11:26:11

如何使用scala + spark比较两个rdds？ 两者都没有钥匙？

问题描述

1 个解决方案

解决方案1 3 2016-11-17 11:26:11

如何使用scala + spark比较两个rdds？两者都没有钥匙？

解决方案1
3 2016-11-17 11:26:11