简体   繁体   English

在Spark中联接数据集

[英]Joining data sets in Spark

What are different ways to join data in Spark? 在Spark中联接数据有哪些不同的方法?

Hadoop map reduce provides - distributed cache, map side join and reduce side join. Hadoop map reduce提供-分布式缓存,map side join和reduce side join。 What about Spark? 那Spark呢?

Also it will be great if you can provide simple scala and python code to join data sets in Spark. 如果您可以提供简单的scala和python代码来联接Spark中的数据集,那也将是很棒的。

Spark has two fundamental distributed data objects. Spark有两个基本的分布式数据对象。 Data frames and RDDs. 数据帧和RDD。

A special case of RDDs in which case both are pairs, can be joined on their keys. 可以将两个都是成对的RDD的特殊情况连接到其键上。 This is available using PairRDDFunctions.join() . 使用PairRDDFunctions.join()可以使用此功能。 See: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions 请参阅: https//spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Dataframes also allow a SQL-like join. 数据框还允许类似SQL的联接。 See: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame 请参阅: http : //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM