Python Spark作业优化

Question

I'm running PySpark (2.3) on a Dataproc cluster with 我在一个Dataproc集群上运行PySpark（2.3）

3 nodes (4 CPUs) 3个节点（4个CPU）
8 GB Memory each. 每个8 GB内存。

The data has close to 1.3 million rows with 4 columns namely: 该数据有近130万行和4列，分别是：

Date,unique_id (Alphanumeric) , category(10 distinct values) and Prediction (0 or 1)

PS - This is timeseries data PS-这是时间序列数据

We are using the Facebooks prophet model for predictive modelling and since Prophet only accepts Pandas dataframes as an input, below is what I am doing in order to convert the Spark dataframe to a Pandas dataframe . 我们正在使用Facebook的先知模型进行预测建模，由于Prophet仅接受Pandas数据框作为输入，因此以下是将Spark数据框转换为Pandas数据框的操作。

def prediction_func(spark_df):

    import pandas as pd 
    # Lines of code to convert spark df to pandas df 
    # Calling prophet model with the converted pandas df 
    return pandas_df 

predictions = spark_df.groupby('category').apply(prediction_func)

The entire process is taking around 1.5 hours on dataproc. 整个过程在dataproc上花费大约1.5个小时。

I am sure there is a better way of grouping and partitioning the data before applying the prediction_func . 我敢肯定，在应用prediction_func之前，有更好的分组和分区方法。

Any advice would be much appreciated. 任何建议将不胜感激。

Answer 1

Since your code doesn't depend on grouping variable you should drop groupBy completely and use scalar UDF in place of Grouped Map. 由于您的代码不依赖于分组变量，因此应完全删除groupBy并使用标量UDF代替Grouped Map。

This way you won't need shuffle and you'll be able to utilize data locality and available resources. 这样，您就不需要洗牌，并且可以利用数据局部性和可用资源。

You'll have to redefine your functions to take all the required columns and return pandas.Series : 您必须重新定义函数，以获取所有必需的列并返回pandas.Series ：

def prediction_func(*cols: pandas.Series) -> pandas.Series:
    ...  # Combine cols into a single pandas.DataFrame and apply the model
    return ...  # Convert result to pandas.Series and return

Example usage: 用法示例：

from pyspark.sql.functions import PandasUDFType, pandas_udf, rand
import pandas as pd
import numpy as np

df = spark.range(100).select(rand(1), rand(2), rand(3)).toDF("x", "y", "z")

@pandas_udf("double", PandasUDFType.SCALAR)
def dummy_prediction_function(x, y, z):
    pdf  = pd.DataFrame({"x": x, "y": y, "z": z})
    pdf["prediction"] = 1.0
    return pdf["prediction"]

df.withColumn("prediction", dummy_prediction_function("x", "y", "z")).show(3)

+-------------------+-------------------+--------------------+----------+       
|                  x|                  y|                   z|prediction|
+-------------------+-------------------+--------------------+----------+
|0.13385709732307427| 0.2630967864682161| 0.11641995793557336|       1.0|
| 0.5897562959687032|0.19795734254405561|   0.605595773295928|       1.0|
|0.01540012100242305|0.25419718814653214|0.006007018601722036|       1.0|
+-------------------+-------------------+--------------------+----------+
only showing top 3 rows

Python Spark作业优化

问题描述

1 个解决方案

解决方案1
0 2018-09-02 17:38:34

Python Spark作业优化

问题描述

1 个解决方案

解决方案1 0 2018-09-02 17:38:34

解决方案1
0 2018-09-02 17:38:34