简体   繁体   English

如何使用Apache-Spark使python代码在AWS从属节点上运行?

[英]How can I make my python code run on the AWS slave nodes using Apache-Spark?

I am learning Apache-Spark as well as its interface with AWS. 我正在学习Apache-Spark及其与AWS的接口。 I've already created a master node on AWS with 6 slave nodes. 我已经在具有6个从属节点的AWS上创建了一个主节点。 I also have the following Python code written with Spark: 我还有以下用Spark编写的Python代码:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("print_num").setMaster("AWS_master_url")
sc = SparkContext(conf = conf)

# Make the list be distributed
rdd = sc.parallelize([1,2,3,4,5])

# Just want each of 5 slave nodes do the mapping work.
temp = rdd.map(lambda x: x + 1)

# Also want another slave node do the reducing work.
for x in temp.sample(False, 1).collect(): 
    print x

My question is how I can set up the 6 slave nodes in AWS, such that 5 slave nodes do the mapping work as I mentioned in the code, and the other slave node do the reducing work. 我的问题是如何在AWS中设置6个从属节点,以使5个从属节点执行映射中的映射工作,而另一个从属节点进行约简工作。 I really appreciate if anyone helps me. 如果有人帮助我,我真的很感激。

From what I understand, you cannot specify five nodes serve as map nodes and one as a reduce node within a single spark cluster. 据我了解,您不能在单个spark集群中指定五个节点作为地图节点,一个节点作为reduce节点。

You could have two clusters running, one with five nodes for running the map tasks and one for the reduce tasks. 您可能有两个集群在运行,一个集群有五个用于运行映射任务的节点,一个集群有用于简化任务的节点。 Then, you could break your code into two different jobs and submit them to the two clusters sequentially, writing the results to disk in between. 然后,您可以将代码分成两个不同的作业,并将它们依次提交给两个集群,然后将结果写入磁盘之间。 However, this might be less efficient than letting Spark handle shuffle communication. 但是,这可能不如让Spark处理随机通信有效。

In Spark, the call to .map() is "lazy" in the sense that it does not execute until the call to an "action." 在Spark中,对.map()的调用是“惰性”的,即直到调用“动作”后才执行。 In your code, this would be the call to .collect() . 在您的代码中,这将是对.collect()的调用。

See https://spark.apache.org/docs/latest/programming-guide.html 参见https://spark.apache.org/docs/latest/programming-guide.html

Out of curiosity, is there a reason you want one node to handle all reductions? 出于好奇,您是否有理由要一个节点来处理所有减少量?

Also, based on the documentation the .sample() function takes three parameters. 另外,根据文档, .sample()函数采用三个参数。 Could you post stderr and stdout from this code? 您可以从此代码中发布stderr和stdout吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM