使用Apache Spark / Spark SQL连接文件

Question

I am trying to use Apache Spark for comparing two different files based on some common field, and get the values from both files and write it as output file. 我正在尝试使用Apache Spark根据某个公共字段比较两个不同的文件，并从两个文件中获取值并将其写为输出文件。

I am using Spark SQL for joining both files (after storing the RDD as table). 我正在使用Spark SQL联接两个文件（将RDD存储为表后）。

Is this the correct approach? 这是正确的方法吗？

Can we compare / join files without Apache SQL? 我们可以在没有Apache SQL的情况下compare / join文件吗？

Please suggest me on this. 请给我建议。

Answer 1

尝试在数据集的两个数据框之间进行内部联接以获取匹配的记录。

Answer 2

If you use plain spark you can join two RDDs. 如果使用普通火花，则可以加入两个RDD。

let a = RDD<Tuple2<K,T>>
let b = RDD<Tuple2<K,S>>
RDD<Tuple2<K,Tuple2<S,T>>> c = a.join(b)

This produces an RDD of every pair for key K. There are also leftOuterJoin, rightOuterJoin, and fullOuterJoin methods on RDD. 这将为密钥K生成每对的RDD。RDD上还有leftOuterJoin，rightOuterJoin和fullOuterJoin方法。

So you have to map both datasets to produce two RDD's indexed by your common key, then join them. 因此，您必须映射两个数据集以生成两个由您的公用密钥索引的RDD索引，然后将它们联接。 Here is the documentation i'm referencing. 这是我参考的文档。

使用Apache Spark / Spark SQL连接文件

问题描述

2 个解决方案

解决方案1
1 2017-03-07 11:57:31

解决方案2
0 已采纳 2015-06-22 07:18:37

使用Apache Spark / Spark SQL连接文件

问题描述

2 个解决方案

解决方案1 1 2017-03-07 11:57:31

解决方案2 0 已采纳 2015-06-22 07:18:37

解决方案1
1 2017-03-07 11:57:31

解决方案2
0 已采纳 2015-06-22 07:18:37