Running tasks in parallel - pyspark

Question

I have a pyspark dataframe and using the same dataframe to create new dataframes and joining them at the end.

For example:

source_dataframe = spark.createDataFrame(rdd, schema).cache()

df1 = function1(source_dataframe)
df2 = function2(source_dataframe)
df3 = function3(source_dataframe)
df4 = function4(source_dataframe)

Each function is independent of each other and finally joining them back to create my final dataframe.

final_df = df1.join(df2, [id]).join(df3, [id]).join(df4, [id])

Is there a way in pyspark that I can run all above functions in parallel, since they are independent of each other? or spark automatically runs them in parallel since they are independent of each other?

Any help will be appreciated. Thanks.

Answer 1

So spark is lazily evaluated and will not evaluate anything unless you apply an action in your function. Whenever you apply a transformation, it is only added to the DAG and everything is evaluated when you trigger an action on the final df.

So, there is no need to execute these transformations concurrently. Spark does it for you as it is distributed.

Another point is since spark is distributed and the workload is divided on multiple executors and if you try to leverage python's multiprocessing, it'll increase the load on your driver node which can result in OOM issues or slow execution.

Running tasks in parallel - pyspark

Question

1 answers

solution1
1 ACCPTED 2020-05-27 05:12:55

Running tasks in parallel - pyspark

Question

1 answers

solution1 1 ACCPTED 2020-05-27 05:12:55

solution1
1 ACCPTED 2020-05-27 05:12:55