Joining two RDDs when keys are not in the same place

Question

I have 2 RDDs that look like this :- RDD1 elements look like this [123, 456, 789] and RDD2 tuples look like this [456, 999]. Now I need to combine/join these 2 RDDs based on 456 which is the 2nd element in RDD1 and the first element in RDD2. Final output looks something like this :- [123, 456, 789, 999]. Is there a way this can be done or do the keys need to be in the first place for the join? Thanks in advance for your time.

Answer 1

you could convert the RDDs to Dataframe and then do a simple join as shown below.

rdd1 = sc.parallelize([(123, 456, 789)])
rdd2 = sc.parallelize([(456, 999)])    
df1 = rdd1.toDF()
df2 = rdd2.toDF()
result = df1.join(df2, df1['_2'] == df2['_1'])
result.rdd.map(lambda x: (x[0],x[1],x[2],x[4])).collect()
[(123, 456, 789, 999)]

Joining two RDDs when keys are not in the same place

Question

1 answers

solution1
0 ACCPTED 2017-02-25 18:18:26

Joining two RDDs when keys are not in the same place

Question

1 answers

solution1 0 ACCPTED 2017-02-25 18:18:26

solution1
0 ACCPTED 2017-02-25 18:18:26