简体   繁体   English

如何并行处理多个pyspark数据帧

[英]How to process multiple pyspark dataframes in parallel

I have a pyspark dataframe with millions of records and hundreds of columns (an example below)我有一个包含数百万条记录和数百列的 pyspark 数据框(以下示例)

clm1, clm2, clm3
code1,xyz,123
code2,abc,345
code1,qwe,456

I want to do divide it into multiple dataframes based on clm1 ie separate dataframe for clm1=code1 and separate dataframe for clm1=code2 and so on and then process them and write the result in separate files.我想这样做,分为基于CLM1即单独的数据帧多dataframes clm1=code1和单独的数据帧clm1=code2等,然后对其进行处理并将结果写入在不同的文件。 I want to perform this operation in parallel to speed up the process.我想并行执行此操作以加快进程。 I am using below code:我正在使用以下代码:

S1 = myclass("code1")
S2 = myclass("code2")


t1 = multiprocessing.Process(target=S1.processdata,args=(df,))
t2 = multiprocessing.Process(target=S2.processdata,args=(df,))
t1.start()
t2.start()

t1.join()
t2.join()

but i am getting below error但我得到了低于错误

Method __getstate__([]) does not exist

If I use threading.Thread instead of multiprocessing.Process it is working fine but that doesn't seem to reduce the overall time如果我使用threading.Thread而不是multiprocessing.Process它工作正常,但这似乎并没有减少整体时间

About error关于错误

Method getstate ([]) does not exist方法getstate ([]) 不存在

It's a py4j.Py4JException .这是一个py4j.Py4JException You have this error with multiprocessing.Process because this module uses processes. multiprocessing.Process出现此错误,因为该模块使用进程。 On the other hand threading.Thread uses threads which use the same memory, so they can share the the dataframe object.另一方面, threading.Thread使用使用相同内存的线程,因此它们可以共享数据帧对象。

Take a also a look in that SO question-answer: Multiprocessing vs Threading Python也看看那个 SO 问题答案: 多处理 vs 线程 Python


General advice一般建议

I understand that maybe you are new, to Spark world and I suggest you my solution for your problem.我知道您可能是 Spark 世界的新手,我建议您使用我的解决方案来解决您的问题。 You asked how to do multiprocessing, but if you have Spark maybe this is not a best practice.您询问了如何进行多处理,但如果您有 Spark,这可能不是最佳实践。

You have Spark- a framework for parallel processing , you don't need to parallelize manually your task.你有 Spark——一个用于并行处理的框架,你不需要手动并行化你的任务。

Spark has designed for parallel computing in a cluster, but it works extremely nice in a large single node. Spark 是为集群中的并行计算而设计的,但它在大型单节点中工作得非常好。 Multiprocessing library is useful in Python computation tasks, in Spark/Pyspark all the computations run in parallel in JVM.多处理库在 Python 计算任务中很有用,在 Spark/Pyspark 中,所有计算都在 JVM 中并行运行。

In python_code.py在 python_code.py 中

import pyspark.sql.functions as f


# JOB 1
df1 = df.filter(f.col('clm1')=='code1')
... many transformations
df1.write.format('..')..

# JOB 2
df2 = df.filter(f.col('clm1')=='code2')
... many transformations
df2.write.format('..')..

And then run this code with spark-submit by using all your cores (* = all cores)然后使用所有内核(* = 所有内核)使用 spark-submit 运行此代码

# Run application locally on all cores
./bin/spark-submit --master local[*] python_code.py

With this approach, you use the Spark power.通过这种方法,您可以使用 Spark 的力量。 The jobs will be executed sequentially BUT you will have: CPU utilization all the time <=> parallel processing <=> lower computation time作业将按顺序执行,但您将拥有: CPU 利用率始终<=>并行处理<=>较低的计算时间

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM