简体繁体 English

Python多处理工具与Py（Spark）

[英]Python multiprocessing tool vs Py(Spark)

原文 2017-06-14 22:06:43 8 1 python/ scikit-learn/ multiprocessing/ pyspark/ cluster-computing

A newbie question, as I get increasingly confused with pyspark. 一个新手问题，因为我越来越困惑与pyspark。 I want to scale an existing python data preprocessing and data analysis pipeline. 我想扩展现有的python数据预处理和数据分析管道。 I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. 我意识到如果我用pyspark对数据进行分区，我不能再将每个分区视为独立的pandas数据帧了，需要学习使用pyspark.sql行/列函数进行操作，并更改大量现有代码，再加上我我必然会激发mllib库，并且不能充分利用更成熟的scikit-learn包。 Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe? 那么，如果我可以使用多处理工具进行集群计算并在现有数据帧上并行化任务，为什么还需要使用Spark呢？

1 个解决方案

True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). 确实，Spark确实有你提到的限制，那就是你在功能性的火花世界中受到限制（spark mllib，dataframes等）。 However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. 但是，与其他多处理工具/库相比，它提供的是并行任务的自动分发，分区和重新缩放。 Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations. 与必须对自定义多处理代码进行编程以响应大量数据+计算相比，扩展和调度spark代码变得更容易。