How to pass list of RDDs to groupWith in Pyspark

Question

I am trying to pass a list of RDDs to groupWith instead of manually specifying them by index.

Here is the sample data

w = sc.parallelize([("1", 5), ("3", 6)])
x = sc.parallelize([("1", 1), ("3", 4)])
y = sc.parallelize([("2", 2), ("4", 3)])
z = sc.parallelize([("2", 42), ("4", 43), ("5", 12)])

Now I have created an array like this.

m = [w,x,y,z]

The manual hardcoded way is

[(x, tuple(map(list, y))) for x, y in sorted(list(m[0].groupWith(m[1],m[2],m[3]).collect()))]

which prints below result

[('1', ([5], [1], [], [])), 
('2', ([], [], [2], [42])), 
('3', ([6], [4], [], ])),
 ('4', ([], [], [3], [43])), 
('5', ([], [], [], [12]))]

But I would like to do something like pass m[1:] instead of passing manually.

[(x, tuple(map(list, y))) for x, y in sorted(list(m[0].groupWith(m[1:]).collect()))]

I tried to remove brackets but it has to be converted to string and i get below error

AttributeError: 'list' object has no attribute 'mapValues'

    AttributeError: 'str' object has no attribute 'mapValues'

Answer 1

由于groupWith接受varargs，所以您要做的就是解压缩参数：

w.groupWith(*m[1:])

How to pass list of RDDs to groupWith in Pyspark

Question

1 answers

solution1
0 ACCPTED 2016-06-04 09:35:38

How to pass list of RDDs to groupWith in Pyspark

Question

1 answers

solution1 0 ACCPTED 2016-06-04 09:35:38

solution1
0 ACCPTED 2016-06-04 09:35:38