简体繁体 English

在Spark中联接数据集

[英]Joining data sets in Spark

原文 2015-12-23 06:13:55 3 1 python/ scala/ apache-spark

What are different ways to join data in Spark? 在Spark中联接数据有哪些不同的方法？

Hadoop map reduce provides - distributed cache, map side join and reduce side join. Hadoop map reduce提供-分布式缓存，map side join和reduce side join。 What about Spark? 那Spark呢？

Also it will be great if you can provide simple scala and python code to join data sets in Spark. 如果您可以提供简单的scala和python代码来联接Spark中的数据集，那也将是很棒的。

1 个解决方案

Spark has two fundamental distributed data objects. Spark有两个基本的分布式数据对象。 Data frames and RDDs. 数据帧和RDD。

A special case of RDDs in which case both are pairs, can be joined on their keys. 可以将两个都是成对的RDD的特殊情况连接到其键上。 This is available using PairRDDFunctions.join() . 使用PairRDDFunctions.join()可以使用此功能。 See: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions 请参阅： https ： //spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Dataframes also allow a SQL-like join. 数据框还允许类似SQL的联接。 See: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame 请参阅： http : //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

在Spark中处理CosmosDB中的大数据集 - Dealing with large data sets from CosmosDB in Spark

有没有办法阻止 JOIN 将新数据集与旧数据集连接起来？ - Is there a way to prevent JOIN from joining new data sets with old?

加入 2 个 Django 查询集 - Joining 2 Django query sets

以相同的值连接两组元组 - Joining two sets of tuples at a common value

加入庞大而庞大的火花数据帧 - Joining a large and a massive spark dataframe

加入火花产生的一列的两个 dataframe - Joining two dataframe of one column generated with spark

在最近的关键条件上加入 Spark DataFrame - Joining Spark DataFrames on a nearest key condition

使用OR条件连接2个数据帧 - Joining 2 Data frames with OR condition

连接点和间隔数据 - joining point and interval data

解析数据集的字符串 - Parsing a string for data sets

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Spark中处理CosmosDB中的大数据集 - Dealing with large data sets from CosmosDB in Spark 有没有办法阻止 JOIN 将新数据集与旧数据集连接起来？ - Is there a way to prevent JOIN from joining new data sets with old? 加入 2 个 Django 查询集 - Joining 2 Django query sets 以相同的值连接两组元组 - Joining two sets of tuples at a common value 加入庞大而庞大的火花数据帧 - Joining a large and a massive spark dataframe 加入火花产生的一列的两个 dataframe - Joining two dataframe of one column generated with spark 在最近的关键条件上加入 Spark DataFrame - Joining Spark DataFrames on a nearest key condition 使用OR条件连接2个数据帧 - Joining 2 Data frames with OR condition 连接点和间隔数据 - joining point and interval data 解析数据集的字符串 - Parsing a string for data sets

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM