简体   繁体   中英

Complicated inner join in Spark or Pyspark

I have table X with key(a,b) and table Y with key(a) . I searched over multiple API functions in Spark but can not find something that can give me a join of both these table by only key(a) .

Have two Data structures (think of them as two tables with two different keys)

X.take(1) -> made of key(a,b)

[((u'"1"', u'"B8"'), (u'"1"', u'"B8"', array([[  7.16677290e-01,   4.15236265e-01,   7.02316511e-02]])))]

Y.take(1) -> have key(a)

[(u'"5"', (u'"5"', array([[ 0.86596322,  0.29811589,  0.29083844,  0.51458565,  0.23767414]])))]

Now, I want a structure something like a -> [a,b,array_1,array_2] .

Cogroup didn't serve my purpose as it returns a cartesian product of key(a,b) and key(a) .

Any suggestions or hints on how can I get a data structure with rows like:

a -> [a,b,array_1,array_2] .

Is there a reason you have to keep the key as (a,b) throughout the entire join? Seems like you could change the structure of your RDD slightly to make the join work.

Just change ((a,b),[value]) to (a,(b,[value])) , then join with (a,[value]) . You'll end up with (a, Iterable((b,[value])), Iterable([value]) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM