简体   繁体   English

Apache spark如何处理python多线程问题?

[英]How does Apache spark handle python multithread issues?

根据python的GIL,我们不能在CPU绑定进程中使用线程,所以我的问题是Apache Spark如何在多核环境中使用python?

Multi-threading python issues are separated from Apache Spark internals. 多线程python问题与Apache Spark内部分离。 Parallelism on Spark is dealt with inside the JVM. Spark上的并行性是在JVM内部处理的。

在此输入图像描述

And the reason is that in the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. 原因是在Python驱动程序中, SparkContext使用Py4J来启动JVM并创建JavaSparkContext。

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; Py4J仅用于驱动程序,用于Python和Java SparkContext对象之间的本地通信; large data transfers are performed through a different mechanism. 通过不同的机制执行大数据传输。

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. Python中的RDD转换映射到Java中的PythonRDD对象上的转换。 On remote worker machines, PythonRDD objects launch Python sub-processes and communicate with them using pipes, sending the user's code and the data to be processed. 在远程工作者计算机上,PythonRDD对象启动Python子进程并使用管道与它们通信,发送用户代码和要处理的数据。

PS: I'm not sure if this actually answers your question completely. PS:我不确定这是否真的完全回答了你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM