简体   繁体   中英

How can I make my python code run on the AWS slave nodes using Apache-Spark?

I am learning Apache-Spark as well as its interface with AWS. I've already created a master node on AWS with 6 slave nodes. I also have the following Python code written with Spark:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("print_num").setMaster("AWS_master_url")
sc = SparkContext(conf = conf)

# Make the list be distributed
rdd = sc.parallelize([1,2,3,4,5])

# Just want each of 5 slave nodes do the mapping work.
temp = rdd.map(lambda x: x + 1)

# Also want another slave node do the reducing work.
for x in temp.sample(False, 1).collect(): 
    print x

My question is how I can set up the 6 slave nodes in AWS, such that 5 slave nodes do the mapping work as I mentioned in the code, and the other slave node do the reducing work. I really appreciate if anyone helps me.

From what I understand, you cannot specify five nodes serve as map nodes and one as a reduce node within a single spark cluster.

You could have two clusters running, one with five nodes for running the map tasks and one for the reduce tasks. Then, you could break your code into two different jobs and submit them to the two clusters sequentially, writing the results to disk in between. However, this might be less efficient than letting Spark handle shuffle communication.

In Spark, the call to .map() is "lazy" in the sense that it does not execute until the call to an "action." In your code, this would be the call to .collect() .

See https://spark.apache.org/docs/latest/programming-guide.html

Out of curiosity, is there a reason you want one node to handle all reductions?

Also, based on the documentation the .sample() function takes three parameters. Could you post stderr and stdout from this code?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM