How to perform vlook up in spark rdd

Question

I have two rdd

rdd1 =[('1', 3428), ('2', 2991), ('3', 2990), ('4', 2883), ('5', 2672), ('5', 2653)]
rdd2 = [['1', 'Toy Story (1995)'], ['2', 'Jumanji (1995)'], ['3', 'Grumpier Old Men (1995)']]

I want to perform an operation to relace first rdd's first element with second rdd's second element

My final result will be like this

[(''Toy Story (1995)'', 3428), ('Jumanji (1995)', 2991), ('Grumpier Old Men (1995)', 2990)]

Please refer me a way to perform this

Answer 1

Use join and map:

rdd1.join(rdd2).map(lambda x: (x[1][1], x[1][0])).collect()
#[('Toy Story (1995)', 3428),
# ('Jumanji (1995)', 2991),
# ('Grumpier Old Men (1995)', 2990)]

Answer 2

You can use a list comprehension for this:

>>> [(y[1], x[1]) for x in rdd1 for y in rdd2 if x[0] == y[0]]
[('Toy Story (1995)', 3428),
 ('Jumanji (1995)', 2991),
 ('Grumpier Old Men (1995)', 2990)]

Answer 3

You can do it using Broadcast and Dataframe operations if working on large data on a cluster for performance gains

df_points = spark.createDataFrame(rdd1, schema=['index', 'points'])
df_movie = spark.createDataFrame(rdd2, schema=['index', 'Movie'])
df_join = df_points.join(broadcast(df_movie), on='index').select("Movie","points")

You can also convert back to RDD if needed

df_join.rdd.map(list).collect()

How to perform vlook up in spark rdd

Question

3 answers

solution1
2 ACCPTED 2019-12-05 12:32:08

solution2
0 2019-12-05 12:09:15

solution3
0 2019-12-05 14:07:34

How to perform vlook up in spark rdd

Question

3 answers

solution1 2 ACCPTED 2019-12-05 12:32:08

solution2 0 2019-12-05 12:09:15

solution3 0 2019-12-05 14:07:34

solution1
2 ACCPTED 2019-12-05 12:32:08

solution2
0 2019-12-05 12:09:15

solution3
0 2019-12-05 14:07:34