I have to do an rdd based operation. My operation as follow;
test1 = rdd.filter(lambda y: (y[0] >= y[1])) # condition 1
test2 = rdd.filter(lambda y: (y[0] < y[1])) # condition 2
result1 = test1.collect()
result2 = test2.collect()
print('(',len(result1),',',len(result2),')')
Can I combine these two conditions into one rdd? I tried something like this;
test3 = test1.zip(test2).collect()
But it didnt work. eg if I apply collect()
to test1 rdd, I get a list. Then I find the length of that list. Similarly I do the same thing for test2 rdd. Now the question is can I do this in one shot? finding lengths of the lists in one go.
IIUC, you can map two conditions into a tuple and convert the resulting boolean values into integers, then do reduce:
# create a sample of rdd with 30 elements
import numpy as np
from operator import add
rdd = sc.parallelize([*map(tuple, np.random.randint(1,100,(30,2)))])
rdd.map(lambda y: (int(y[0] >= y[1]), int(y[0] < y[1]))) \
.reduce(lambda x,y: tuple(map(add, x,y)))
#(19, 11)
你的意思是只得到一个结果而不是 2 个?
test = rdd.filter(lambda y: (y[0] >= y[1]) and ((y[0] < y[1])))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.