Complicated inner join in Spark or Pyspark

Question

I have table X with key(a,b) and table Y with key(a) . I searched over multiple API functions in Spark but can not find something that can give me a join of both these table by only key(a) .

Have two Data structures (think of them as two tables with two different keys)

X.take(1) -> made of key(a,b)

[((u'"1"', u'"B8"'), (u'"1"', u'"B8"', array([[  7.16677290e-01,   4.15236265e-01,   7.02316511e-02]])))]

Y.take(1) -> have key(a)

[(u'"5"', (u'"5"', array([[ 0.86596322,  0.29811589,  0.29083844,  0.51458565,  0.23767414]])))]

Now, I want a structure something like a -> [a,b,array_1,array_2] .

Cogroup didn't serve my purpose as it returns a cartesian product of key(a,b) and key(a) .

Any suggestions or hints on how can I get a data structure with rows like:

a -> [a,b,array_1,array_2] .

Answer 1

Is there a reason you have to keep the key as (a,b) throughout the entire join? Seems like you could change the structure of your RDD slightly to make the join work.

Just change ((a,b),[value]) to (a,(b,[value])) , then join with (a,[value]) . You'll end up with (a, Iterable((b,[value])), Iterable([value]) .

Complicated inner join in Spark or Pyspark

Question

1 answers

solution1
1 2014-12-16 21:09:24

Complicated inner join in Spark or Pyspark

Question

1 answers

solution1 1 2014-12-16 21:09:24

solution1
1 2014-12-16 21:09:24