简体   繁体   中英

combine two rdd in pyspark operation when filtering operation

I have to do an rdd based operation. My operation as follow;

test1 = rdd.filter(lambda y: (y[0] >= y[1])) # condition 1
test2 = rdd.filter(lambda y: (y[0] < y[1])) # condition 2
result1 = test1.collect()
result2 = test2.collect()
print('(',len(result1),',',len(result2),')')

Can I combine these two conditions into one rdd? I tried something like this;

test3 = test1.zip(test2).collect()

But it didnt work. eg if I apply collect() to test1 rdd, I get a list. Then I find the length of that list. Similarly I do the same thing for test2 rdd. Now the question is can I do this in one shot? finding lengths of the lists in one go.

IIUC, you can map two conditions into a tuple and convert the resulting boolean values into integers, then do reduce:

# create a sample of rdd with 30 elements
import numpy as np
from operator import add

rdd = sc.parallelize([*map(tuple, np.random.randint(1,100,(30,2)))])

rdd.map(lambda y: (int(y[0] >= y[1]), int(y[0] < y[1]))) \
   .reduce(lambda x,y: tuple(map(add, x,y)))
#(19, 11)

你的意思是只得到一个结果而不是 2 个?

   test = rdd.filter(lambda y: (y[0] >= y[1]) and ((y[0] < y[1]))) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM