简体   繁体   中英

Running tasks in parallel - pyspark

I have a pyspark dataframe and using the same dataframe to create new dataframes and joining them at the end.

For example:

source_dataframe = spark.createDataFrame(rdd, schema).cache()

df1 = function1(source_dataframe)
df2 = function2(source_dataframe)
df3 = function3(source_dataframe)
df4 = function4(source_dataframe)

Each function is independent of each other and finally joining them back to create my final dataframe.

final_df = df1.join(df2, [id]).join(df3, [id]).join(df4, [id])

Is there a way in pyspark that I can run all above functions in parallel, since they are independent of each other? or spark automatically runs them in parallel since they are independent of each other?

Any help will be appreciated. Thanks.

So spark is lazily evaluated and will not evaluate anything unless you apply an action in your function. Whenever you apply a transformation, it is only added to the DAG and everything is evaluated when you trigger an action on the final df.

So, there is no need to execute these transformations concurrently. Spark does it for you as it is distributed.

Another point is since spark is distributed and the workload is divided on multiple executors and if you try to leverage python's multiprocessing, it'll increase the load on your driver node which can result in OOM issues or slow execution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM