[英]Spark - Scala - Join RDDS (csv) files
I'm coming in and learning scala, as I am in the initial steps, appeared a demand and need to know how to join in two fields like a relational database. 正如我刚开始时一样,我正在学习scala,这是一种需求,并且需要知道如何加入关系数据库等两个领域。
Example: 例:
Table 1 ( csv )
表1(csv)
zip_type, primary_city, acceptable_cities, unacceptable_cities
zip_type,primary_city,acceptable_cities,unacceptable_cities
Example: 例:
Table 2 ( csv )
表2(csv)
GEO.id, GEO.id2, GEO.display-label, VD01
GEO.id,GEO.id2,GEO.display-label,VD01
Question: 题:
I want to join Column1 (zip type)Table1 with Column2(GEO.id2)Table2. 我想将Column1(zip类型)Table1与Column2(GEO.id2)Table2加入。
Currently I: 目前我:
What do I need to do next? 接下来我需要做什么?
To make join you need pair-rdds with same key column. 要进行联接,您需要使用具有相同键列的pair-rdds。 Consider transforming RDD-1 into RDD of tuple (K, V) with zip-type as key, similarly RDD-2 with GEO.id2 as key.
考虑将RDD-1转换为以zip-type为键的元组(K,V)的RDD,类似地以GEO.id2为键的RDD-2。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.