简体繁体 English

Apache Spark Python到Scala的翻译

[英]Apache Spark Python to Scala translation

原文 2015-08-18 09:05:48 4 1 python/ hadoop/ apache-spark/ yarn/ pyspark

If I got it right Apache YARN receives Application Master and Node Manager as JAR files. 如果我说得对，Apache YARN会收到Application Master和Node Manager作为JAR文件。 They executed as Java process on the nodes of the YARN cluster. 它们在YARN集群的节点上作为Java进程执行。 When I write a Spark program using Python, Does it compiled into JAR somehow? 当我使用Python编写Spark程序时，它是否以某种方式编译成JAR？ If not how come Spark is able to execute Python logic on YARN cluster nodes? 如果不是，为什么Spark能够在YARN集群节点上执行Python逻辑？

1 个解决方案

The PySpark driver program uses Py4J ( http://py4j.sourceforge.net/ ) to launch a JVM and create a Spark Context. PySpark驱动程序使用Py4J（ http://py4j.sourceforge.net/ ）启动JVM并创建Spark Context。 Spark RDD operations written in Python are mapped to operations on PythonRDD. 用Python编写的Spark RDD操作映射到PythonRDD上的操作。

On the remote workers, PythonRDD launches sub-processes which run Python. 在远程工作者上，PythonRDD启动运行Python的子进程。 The data and code is passed from the Remote Worker's JVM to its Python sub-process using pipes. 数据和代码使用管道从远程工作者的JVM传递到其Python子进程。

Therefore, it is necessary for your YARN nodes to have python installed for this to work. 因此，您的YARN节点必须安装python才能工作。

The python code is not compiled to a JAR, but is distributed around the cluster using Spark. python代码不会编译为JAR，而是使用Spark在集群中分布。 In order to make this possible, user functions written in Python are pickled using the following code https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py 为了实现这一点，使用以下代码https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py来腌制用Python编写的用户函数。

Source: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 资料来源： https ： //cwiki.apache.org/confluence/display/SPARK/PySpark+Internals