Pyspark 监控指标没有意义

Question

I am trying to understand the spark ui and hdfs ui while using pyspark. Following are my properties for the Session that I am running我试图在使用 pyspark 时了解 spark ui 和 hdfs ui。以下是我正在运行的 Session 的属性

pyspark --master yarn --num-executors 4 --executor-memory 6G --executor-cores 3 --conf spark.dynamicAllocation.enabled=false --conf spark.exector.memoryOverhead=2G --conf spark.memory.offHeap.size=2G --conf spark.pyspark.memory=2G

I ran a simple code to read a file (~9 GB on disk) in the memory twice.我运行了一个简单的代码来读取 memory 中的文件（磁盘上约 9 GB）两次。 And, then merge the two files and persist the results and ran a count action.并且，然后合并这两个文件并保留结果并运行计数操作。

#Reading the same file twice
df_sales = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_sales_copy = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
#caching one
from pyspark import StorageLevel
df_sales = df_sales.persist(StorageLevel.MEMORY_AND_DISK)

#merging the two read files
df_merged = df_sales.join(df_sales_copy,df_sales.order_id==df_sales_copy.order_id,'inner')
df_merged = df_merged.persist(StorageLevel.MEMORY_AND_DISK)
#calling an action to trigger the transformations
df_merged.count()

I expect:我预计：

The data is to first be persisted in the Memory and then on disk数据先持久化到Memory再到磁盘
The HDFS capacity to be utilized at least to the extent that the data persist spilled the data on disk HDFS 容量至少要使用到数据持久化的程度，将数据溢出到磁盘上

Both of these expectations are failing in the monitoring that follows:这两种期望在以下监控中都失败了：

Expectation 1: Failed.期望 1：失败。 Actually, the data is being persisted on disk first and then in memory maybe.实际上，数据首先保存在磁盘上，然后可能保存在 memory 中。 Not sure.不确定。 The following image should help.下图应该有所帮助。 Definitely not in the disk first unless im missing something除非我遗漏了什么，否则绝对不会先在磁盘中

Expectation 2: Failed.期望 2：失败。 The HDFS capacity is not at all used up (only 1.97 GB) HDFS 容量根本没有用完（只有 1.97 GB）

Can you please help me reconcile my understanding and tell me where I'm wrong in expecting the mentioned behaviour and what it actually is that I'm looking at in those images?你能帮我调和我的理解并告诉我我在预期提到的行为中错在哪里以及我在这些图像中看到的实际上是什么吗？

Answer 1

Don't use persist to disk until you have a really good reason.(You should only performance tune when you have identified a bottle neck) It takes way longer to write to disk than it does to just proceed with processing the data.除非你有充分的理由，否则不要使用持久化到磁盘。（你应该只在确定瓶颈时进行性能调整）写入磁盘比继续处理数据花费的时间要长得多。 for that reason persist to disk should only be used when you have a reason.出于这个原因，只应在有理由的情况下使用 persist to disk 。 Reading data in for a count is not one of those reasons.读取数据进行计数并不是这些原因之一。

I humbly suggest that don't alter the spark launch parameters unless you have a reason.(And you understand them.) Here your not going to fit your data into memory because of your spark launch configuration.我谦虚地建议除非你有理由，否则不要改变火花发射参数。（你理解它们。）这里你不会因为你的火花发射配置而将你的数据放入 memory 中。 (You split the space into 2 Gig allocations which means you'll never fit 9 gigs into the 6 gigs you have) I think that you should consider removing all your configuration and see how that changes what's used in memory. Playing with these launch configuration will help you learn what each parameter does. （您将空间分成 2 Gig 分配，这意味着您永远不会将 9 Gig 放入您拥有的 6 Gig 中）我认为您应该考虑删除所有配置并查看它如何改变 memory 中使用的内容。使用这些启动配置将帮助您了解每个参数的作用。 That might help you learn more.这可能会帮助您了解更多信息。

Really it's hard to really provide more advice because there is a lot to learn and explain.真的很难提供更多建议，因为有很多东西需要学习和解释。 Perhaps you'll get luck and someone else will answer your question.也许你会走运，其他人会回答你的问题。

Pyspark 监控指标没有意义

问题描述

1 个解决方案

解决方案1
0 2022-12-08 19:16:14

Pyspark 监控指标没有意义

问题描述

1 个解决方案

解决方案1 0 2022-12-08 19:16:14

解决方案1
0 2022-12-08 19:16:14