简体   繁体   English

Spark-Scala-加入RDDS(csv)文件

[英]Spark - Scala - Join RDDS (csv) files

I'm coming in and learning scala, as I am in the initial steps, appeared a demand and need to know how to join in two fields like a relational database. 正如我刚开始时一样,我正在学习scala,这是一种需求,并且需要知道如何加入关系数据库等两个领域。

Example: 例:

Table 1 ( csv ) 表1(csv)

zip_type, primary_city, acceptable_cities, unacceptable_cities zip_type,primary_city,acceptable_cities,unacceptable_cities

Example: 例:

Table 2 ( csv ) 表2(csv)

GEO.id, GEO.id2, GEO.display-label, VD01 GEO.id,GEO.id2,GEO.display-label,VD01

Question: 题:

I want to join Column1 (zip type)Table1 with Column2(GEO.id2)Table2. 我想将Column1(zip类型)Table1与Column2(GEO.id2)Table2加入。

Currently I: 目前我:

  • Created an RDD with my CSV file 用我的CSV文件创建了一个RDD
  • Processed each line using the CSV parser but I have a little trouble to making the join. 使用CSV解析器处理了每一行,但加入联接时有些麻烦。

What do I need to do next? 接下来我需要做什么?

To make join you need pair-rdds with same key column. 要进行联接,您需要使用具有相同键列的pair-rdds。 Consider transforming RDD-1 into RDD of tuple (K, V) with zip-type as key, similarly RDD-2 with GEO.id2 as key. 考虑将RDD-1转换为以zip-type为键的元组(K,V)的RDD,类似地以GEO.id2为键的RDD-2。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM