简体   繁体   中英

Python Spark Job Optimization

I'm running PySpark (2.3) on a Dataproc cluster with

  • 3 nodes (4 CPUs)
  • 8 GB Memory each.

The data has close to 1.3 million rows with 4 columns namely:

Date,unique_id (Alphanumeric) , category(10 distinct values) and Prediction (0 or 1) 

PS - This is timeseries data

We are using the Facebooks prophet model for predictive modelling and since Prophet only accepts Pandas dataframes as an input, below is what I am doing in order to convert the Spark dataframe to a Pandas dataframe .

def prediction_func(spark_df):

    import pandas as pd 
    # Lines of code to convert spark df to pandas df 
    # Calling prophet model with the converted pandas df 
    return pandas_df 

predictions = spark_df.groupby('category').apply(prediction_func)

The entire process is taking around 1.5 hours on dataproc.

I am sure there is a better way of grouping and partitioning the data before applying the prediction_func .

Any advice would be much appreciated.

Since your code doesn't depend on grouping variable you should drop groupBy completely and use scalar UDF in place of Grouped Map.

This way you won't need shuffle and you'll be able to utilize data locality and available resources.

You'll have to redefine your functions to take all the required columns and return pandas.Series :

def prediction_func(*cols: pandas.Series) -> pandas.Series:
    ...  # Combine cols into a single pandas.DataFrame and apply the model
    return ...  # Convert result to pandas.Series and return

Example usage:

from pyspark.sql.functions import PandasUDFType, pandas_udf, rand
import pandas as pd
import numpy as np

df = spark.range(100).select(rand(1), rand(2), rand(3)).toDF("x", "y", "z")

@pandas_udf("double", PandasUDFType.SCALAR)
def dummy_prediction_function(x, y, z):
    pdf  = pd.DataFrame({"x": x, "y": y, "z": z})
    pdf["prediction"] = 1.0
    return pdf["prediction"]

df.withColumn("prediction", dummy_prediction_function("x", "y", "z")).show(3)
+-------------------+-------------------+--------------------+----------+       
|                  x|                  y|                   z|prediction|
+-------------------+-------------------+--------------------+----------+
|0.13385709732307427| 0.2630967864682161| 0.11641995793557336|       1.0|
| 0.5897562959687032|0.19795734254405561|   0.605595773295928|       1.0|
|0.01540012100242305|0.25419718814653214|0.006007018601722036|       1.0|
+-------------------+-------------------+--------------------+----------+
only showing top 3 rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM