combine two rdd in pyspark operation when filtering operation

Question

I have to do an rdd based operation. My operation as follow;

test1 = rdd.filter(lambda y: (y[0] >= y[1])) # condition 1
test2 = rdd.filter(lambda y: (y[0] < y[1])) # condition 2
result1 = test1.collect()
result2 = test2.collect()
print('(',len(result1),',',len(result2),')')

Can I combine these two conditions into one rdd? I tried something like this;

test3 = test1.zip(test2).collect()

But it didnt work. eg if I apply collect() to test1 rdd, I get a list. Then I find the length of that list. Similarly I do the same thing for test2 rdd. Now the question is can I do this in one shot? finding lengths of the lists in one go.

Answer 1

IIUC, you can map two conditions into a tuple and convert the resulting boolean values into integers, then do reduce:

# create a sample of rdd with 30 elements
import numpy as np
from operator import add

rdd = sc.parallelize([*map(tuple, np.random.randint(1,100,(30,2)))])

rdd.map(lambda y: (int(y[0] >= y[1]), int(y[0] < y[1]))) \
   .reduce(lambda x,y: tuple(map(add, x,y)))
#(19, 11)

Answer 2

你的意思是只得到一个结果而不是 2 个？

   test = rdd.filter(lambda y: (y[0] >= y[1]) and ((y[0] < y[1])))

combine two rdd in pyspark operation when filtering operation

Question

2 answers

solution1
3 ACCPTED 2019-12-26 01:45:37

solution2
1 2019-12-24 21:19:37

combine two rdd in pyspark operation when filtering operation

Question

2 answers

solution1 3 ACCPTED 2019-12-26 01:45:37

solution2 1 2019-12-24 21:19:37

solution1
3 ACCPTED 2019-12-26 01:45:37

solution2
1 2019-12-24 21:19:37