在 Databricks 中将 cache() 和 count() 应用于 Spark Dataframe 非常慢 [pyspark]

Question

I have a spark dataframe in Databricks cluster with 5 million rows.我在具有 500 万行的 Databricks 集群中有一个 spark dataframe。 And what I want is to cache this spark dataframe and then apply.count() so for the next operations to run extremely fast.我想要的是缓存这个火花 dataframe 然后 apply.count() 以便下一个操作运行得非常快。 I have done it in the past with 20,000 rows and it works.我过去用 20,000 行完成了它并且它有效。 However, in my trial to do this I came into the following paradox:然而，在我尝试这样做时，我遇到了以下悖论：

Dataframe creation Dataframe 创建

Step 1: Read 8 millions rows from Azure Data Lake storage account步骤 1：从 Azure Data Lake 存储帐户读取 800 万行

read_avro_data=spark.read.format("avro").load(list_of_paths) #list_of_paths[0]='abfss://storage_container_name@storage_account_name.dfs.core.windows.net/folder_1/folder_2/0/2020/06/02/00/00/27.avro'
avro_decoded=read_avro_data.withColumn('Body_decoded', sql_function.decode(read_avro_data.Body, charset="UTF-8")).select("Body_decoded")
datalake_spark_dataframe=datalake_spark_dataframe.union(avro_decoded.withColumn("Body_decoded", sql_function.from_json("Body_decoded", schema)).select(*['Body_decoded.{}'.format(x) for x in columns_selected]))

datalake_spark_dataframe.printSchema()
"root
 |-- id: string (nullable = true)
 |-- BatteryPercentage: float (nullable = true)
 |-- SensorConnected: integer (nullable = false)
 |-- TemperatureOutside: float (nullable = true)
 |-- ReceivedOn: string (nullable = true)"

datalake_spark_dataframe.rdd.getNumPartitions() # 635 partitions

This dataframe has 8 million rows.这个 dataframe 有 800 万行。 With 8 million rows my application runs pretty good, but I wanted to stress test my application in a big-data environment.我的应用程序有 800 万行运行良好，但我想在大数据环境中对我的应用程序进行压力测试。 Because 8 million rows is not Big-Data.因为 800 万行不是大数据。 Thus I replicated my 8 millions rows Spark Dataframe 287 times ~ 2.2 billion rows.因此，我将我的 800 万行 Spark Dataframe 复制了 287 次 ~ 22 亿行。 To make the replication I did the following:为了进行复制，我执行了以下操作：

Step 2: Replicate the 8 million rows dataframe第 2 步：复制 800 万行 dataframe

datalake_spark_dataframe_new=datalake_spark_dataframe
for i in range(287):
  print(i)
  datalake_spark_dataframe_new=datalake_spark_dataframe_new.union(datalake_spark_dataframe)
  print("done on iteration: {0}".format(i))

datalake_spark_dataframe_new.rdd.getNumPartitions() #182880

Having the final 2.2 billion rows dataframe, I made a time-window GroupBy of my data, ending up with some millions of rows.有了最后的 22 亿行 dataframe，我用我的数据创建了一个时间窗口 GroupBy，最终有几百万行。 I have written approximately that the grouped dataset has 5 million rows in the top of my question.我已经大致写到分组数据集在我的问题顶部有 500 万行。

Step 3: GroupBy the 2.2 billion rows dataframe by a time window of 6 hours & Apply the.cache() and.count()第 3 步：将 22 亿行 dataframe 分组，时间为 6 小时 window 并应用 .cache() 和 .count()

%sql set spark.sql.shuffle.partitions=100

import pyspark.sql.functions as sql_function
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, BooleanType, DateType, DoubleType, ArrayType

datalake_spark_dataframe_downsampled=datalake_spark_dataframe_new.withColumn(timestamp_column, sql_function.to_timestamp(timestamp_column, "yyyy-MM-dd HH:mm"))
datalake_spark_dataframe_downsampled=datalake_spark_dataframe_downsampled.groupBy("id", sql_function.window("ReceivedOn","{0} minutes".format(time_interval)))\
                                                                         .agg(
                                                                              sql_function.mean("BatteryPercentage").alias("BatteryPercentage"),
                                                                              sql_function.mean("SensorConnected").alias("OuterSensorConnected"),
                                                                              sql_function.mean("TemperatureOutside").alias("averageTemperatureOutside"))

columns_to_drop=['window']
datalake_spark_dataframe_downsampled=datalake_spark_dataframe_downsampled.drop(*columns_to_drop)

# From 2.2 billion rows down to 5 million rows after the GroupBy...
datalake_spark_dataframe_downsampled.repartition(100)
datalake_spark_dataframe_downsampled.cache()
datalake_spark_dataframe_downsampled.count() # job execution takes for ever

datalake_spark_dataframe_downsampled.rdd.getNumPartitions() #100 after re-partition

Spark UI before showing the.count()显示 .count() 之前的 Spark UI

Spark UI during the count execution计数执行期间的 Spark UI

When I apply the following commands to my spark Dataframe it takes more than 3 hours to complete this task, which in the end fails.当我将以下命令应用于我的 spark Dataframe 时，完成此任务需要 3 个多小时，最终失败。

I want to add that before and after the re-partitioning, the job had the same behavior in time execution.我想补充一点，在重新分区之前和之后，作业在时间执行中具有相同的行为。 So, I did the re-partitioning in case the default values were making the job to run very slow.因此，我进行了重新分区，以防默认值使作业运行非常缓慢。 Thus, I was keep adding partitions in case the job would execute faster.因此，我一直在添加分区，以防作业执行得更快。

%sql set spark.sql.shuffle.partitions=1000000
datalake_spark_dataframe_downsampled.repartition(1000000)

datalake_spark_dataframe_downsampled.cache()
datalake_spark_dataframe_downsampled.count()

Below is the output of the spark job:下面是spark作业的output：

The error I get:我得到的错误：

My cluster resources:我的集群资源：

As you can see it's not a matter of RAM or CPU cores, as I have plenty of them.如您所见，这与 RAM 或 CPU 内核无关，因为我有很多。 Why the job splits only to 5 stages even after I apply re-partitioning?为什么即使在我申请重新分区后，作业也只分成5 个阶段？ and How can I split the jobs so the.cache() and.count() commands run faster based on my 48 vCPU cores?以及如何拆分作业，以便 .cache() 和 .count() 命令基于我的 48 个 vCPU 内核运行得更快？

Screenshots provided per job execution Execution on 80 million rows (8m * 10 iterations = 80m rows)每个作业执行提供的屏幕截图执行 8000 万行（8m * 10 次迭代 = 80m 行）

Answer 1

I had the similiar issue in the past while iterating through for loop as my iteration is dynamic depending on input combination.我过去在迭代 for 循环时遇到了类似的问题，因为我的迭代是动态的，取决于输入组合。

I resolved the performance issue by persisting data (you can try to persist in ADLS2 or if in case On-Prem then HDFS / Hive Tables) on each iteration.我通过在每次迭代中保留数据（您可以尝试在 ADLS2 中保留，或者如果在 On-Prem 中保留 HDFS / Hive 表）解决了性能问题。 In next Iteration again read from that location, union and again overwrite the same location.在下一次迭代中再次从该位置读取，联合并再次覆盖相同的位置。 There is a network lag and not efficient.存在网络滞后，效率不高。 Still it brought down execution time by 10x.它仍然将执行时间降低了 10 倍。

Possible reason could be Spark Lineage (I believe for every iteration it does all previous iteration again and again).可能的原因可能是 Spark Lineage（我相信每次迭代都会一次又一次地执行所有先前的迭代）。 Persisting data with overwrite avoids that.通过覆盖持久化数据可以避免这种情况。 I tried cache() and other options as well but did not help me.我也尝试了 cache() 和其他选项，但没有帮助我。

Edited #1 Try something like this编辑 #1 尝试这样的事情

datalake_spark_dataframe_new=datalake_spark_dataframe
datalake_spark_dataframe.write.mode("overwrite").option("header", "true").format("parquet").save("abfss://<ADLS_PATH>")
for i in range(287):
  print(i)
  datalake_spark_dataframe_new=spark.read.parquet("abfss://<ADLS_PATH>")
  datalake_spark_dataframe_new.union(datalake_spark_dataframe).write.mode("overwrite").option("header", "true").format("parquet").save("abfss://<ADLS_PATH>")
  print("done on iteration: {0}".format(i))

Edited #2 This should be more efficient than previous edition,编辑#2 这应该比以前的版本更有效率，

for i in range(287):
  print(i)
  datalake_spark_dataframe.write.mode("append").option("header", "true").format("parquet").save("abfss://<ADLS_PATH>")
  print("done on iteration: {0}".format(i))

datalake_spark_dataframe_new=spark.read.parquet("abfss://<ADLS_PATH>")

Answer 2

I think you have used very huge shuffle partition number 1000000 that why it is taking more time to complete job.我认为您使用了非常大的 shuffle 分区号1000000 ，这就是为什么它需要更多时间来完成工作。

I will follow below logic to calculate shuffle partition based on data size.我将按照以下逻辑根据数据大小计算随机分区。 for example例如

Say 5 millions of data is comes around 20 GB of data.假设 500 万条数据来自大约 20GB 的数据。

shuffle stage input = 20 GB洗牌阶段输入 = 20 GB

so total number of shuffle partitions are 20000MB/200MB = 100,所以 shuffle 分区的总数是 20000MB/200MB = 100，

let assume only 50 cores in cluster in that case shuffle partition value is 50 or 200 cores in cluster in that case shuffle partition value will be 200.假设集群中只有 50 个核心，在这种情况下，shuffle 分区值是 50 或 200 个核心，在这种情况下，shuffle 分区值将是 200。

Choosing high value as shuffle partition value there will be lot of shuffling data & hence task will take more time to complete or sometime it might fail.选择高值作为 shuffle 分区值会有很多 shuffle 数据，因此任务将需要更多时间才能完成，或者有时它可能会失败。

spark.sql.shuffle.partitions=50 // 50 or 100 for better option. spark.sql.shuffle.partitions=50 // 50 或 100 以获得更好的选择。

在 Databricks 中将 cache() 和 count() 应用于 Spark Dataframe 非常慢 [pyspark]

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-03 22:11:27

解决方案2
0 2020-06-02 04:18:48

在 Databricks 中将 cache() 和 count() 应用于 Spark Dataframe 非常慢 [pyspark]

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-03 22:11:27

解决方案2 0 2020-06-02 04:18:48

解决方案1
1 已采纳 2020-06-03 22:11:27

解决方案2
0 2020-06-02 04:18:48