简体   繁体   中英

How can I use 'for' loop to do Transformation and Output in Spark-Streaming's DStream?

I am a rookie in Spark and I generate 1000 different instances using a class that I defined (functions in those instances are the same but detailed functions' parameters are different). sampler=generateClass() Then I need to map those instances' functions to my Stream.(to test, just use 10 and 2 instances)

s=[]
for i in range(10):        
    s.append(mappedStream.map(lambda x: sampler[i].insert(x)).reduce(min))

uStream=ssc.union(s[0],s[1],s[2],s[3],s[4],s[5],s[6],s[7],s[8],s[9])
uStream.pprint()

But its output is just 10 same key-value pairs, it seems that these code just map my data to the first instances and then repeated 10 times.

(85829323L, [2, 1])
(85829323L, [2, 1])
(85829323L, [2, 1])
(85829323L, [2, 1])
....

Then, I try

myStream1=mappedStream.map(lambda x: sampler[0].insert(x)).reduce(min)
myStream2=mappedStream.map(lambda x: sampler[1].insert(x)).reduce(min)
ssc.union(myStream1,myStream2).pprint()

the output is right:

(85829323L, [2, 1])
(99580454L, [4, 1])

Why this happen? And how can I handle it? Thank you very much.

This happens because python lambda's are lazy evaluated and when you call an action on s[0] is uses the last i parameter to calculate ( 9 in your case, it is the last loop value).

You can use function generator pattern to "force" using appropriate i , for example:

def call_sampler(i):
    return lambda x: sampler[i].insert(x)

s=[]
for i in range(10):        
    s.append(mappedStream.map(call_sampler(i)).reduce(min))

uStream=ssc.union(s[0],s[1],s[2],s[3],s[4],s[5],s[6],s[7],s[8],s[9])
uStream.pprint()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM