I'm running PySpark (2.3) on a Dataproc cluster with
The data has close to 1.3 million rows with 4 columns namely:
Date,unique_id (Alphanumeric) , category(10 distinct values) and Prediction (0 or 1)
PS - This is timeseries data
We are using the Facebooks prophet model for predictive modelling and since Prophet only accepts Pandas dataframes as an input, below is what I am doing in order to convert the Spark dataframe to a Pandas dataframe .
def prediction_func(spark_df):
import pandas as pd
# Lines of code to convert spark df to pandas df
# Calling prophet model with the converted pandas df
return pandas_df
predictions = spark_df.groupby('category').apply(prediction_func)
The entire process is taking around 1.5 hours on dataproc.
I am sure there is a better way of grouping and partitioning the data before applying the prediction_func
.
Any advice would be much appreciated.
Since your code doesn't depend on grouping variable you should drop groupBy
completely and use scalar UDF in place of Grouped Map.
This way you won't need shuffle and you'll be able to utilize data locality and available resources.
You'll have to redefine your functions to take all the required columns and return pandas.Series
:
def prediction_func(*cols: pandas.Series) -> pandas.Series:
... # Combine cols into a single pandas.DataFrame and apply the model
return ... # Convert result to pandas.Series and return
Example usage:
from pyspark.sql.functions import PandasUDFType, pandas_udf, rand
import pandas as pd
import numpy as np
df = spark.range(100).select(rand(1), rand(2), rand(3)).toDF("x", "y", "z")
@pandas_udf("double", PandasUDFType.SCALAR)
def dummy_prediction_function(x, y, z):
pdf = pd.DataFrame({"x": x, "y": y, "z": z})
pdf["prediction"] = 1.0
return pdf["prediction"]
df.withColumn("prediction", dummy_prediction_function("x", "y", "z")).show(3)
+-------------------+-------------------+--------------------+----------+
| x| y| z|prediction|
+-------------------+-------------------+--------------------+----------+
|0.13385709732307427| 0.2630967864682161| 0.11641995793557336| 1.0|
| 0.5897562959687032|0.19795734254405561| 0.605595773295928| 1.0|
|0.01540012100242305|0.25419718814653214|0.006007018601722036| 1.0|
+-------------------+-------------------+--------------------+----------+
only showing top 3 rows
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.