简体   繁体   中英

Joining two RDDs when keys are not in the same place

I have 2 RDDs that look like this :- RDD1 elements look like this [123, 456, 789] and RDD2 tuples look like this [456, 999]. Now I need to combine/join these 2 RDDs based on 456 which is the 2nd element in RDD1 and the first element in RDD2. Final output looks something like this :- [123, 456, 789, 999]. Is there a way this can be done or do the keys need to be in the first place for the join? Thanks in advance for your time.

you could convert the RDDs to Dataframe and then do a simple join as shown below.

rdd1 = sc.parallelize([(123, 456, 789)])
rdd2 = sc.parallelize([(456, 999)])    
df1 = rdd1.toDF()
df2 = rdd2.toDF()
result = df1.join(df2, df1['_2'] == df2['_1'])
result.rdd.map(lambda x: (x[0],x[1],x[2],x[4])).collect()
[(123, 456, 789, 999)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM