简体   繁体   English

满负荷使用 PySpark

[英]Use PySpark at full capacity

I have a Spark Cluster on Google Dataproc using Compute Engine.我在使用 Compute Engine 的 Google Dataproc 上有一个 Spark 集群。 The cluster has 1 Master node with 4 cores and 16GB RAM and 5 Worker nodes with 8 cores and 32GB RAM each.该集群有 1 个具有 4 个内核和 16GB RAM 的主节点和 5 个具有 8 个内核和 32GB RAM 的 Worker 节点。

When running SparkConf().getAll() I get this result:运行SparkConf().getAll()时,我得到以下结果:

[('spark.eventLog.enabled', 'true'),
 ('spark.dynamicAllocation.minExecutors', '1'),
 ('spark.driver.maxResultSize', '2048m'),
 ('spark.executor.memory', '12859m'),
 ('spark.yarn.am.memory', '640m'),
 ('spark.executor.cores', '4'),
 ('spark.eventLog.dir',
  'gs://dataproc-temp-europe-west1-907569830041-jsgvqmyn/0255e376-31c9-4b52-8e63-a4fe6188eba3/spark-job-history'),
 ('spark.executor.instances', '2'),
 ('spark.yarn.unmanagedAM.enabled', 'true'),
 ('spark.submit.deployMode', 'client'),
 ('spark.extraListeners',
  'com.google.cloud.spark.performance.DataprocMetricsListener'),
 ('spark.driver.memory', '4096m'),
 ('spark.sql.cbo.joinReorder.enabled', 'true'),
 ('spark.sql.autoBroadcastJoinThreshold', '96m'),
 ('spark.shuffle.service.enabled', 'true'),
 ('spark.metrics.namespace',
  'app_name:${spark.app.name}.app_id:${spark.app.id}'),
 ('spark.scheduler.mode', 'FAIR'),
 ('spark.yarn.historyServer.address', 'congenial-sturdy-bassoon-m:18080'),
 ('spark.sql.adaptive.enabled', 'true'),
 ('spark.yarn.jars', 'local:/usr/lib/spark/jars/*'),
 ('spark.scheduler.minRegisteredResourcesRatio', '0.0'),
 ('spark.hadoop.hive.execution.engine', 'mr'),
 ('spark.app.name', 'PySparkShell'),
 ('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2'),
 ('spark.dynamicAllocation.maxExecutors', '10000'),
 ('spark.ui.proxyBase', '/proxy/application_1663842742689_0013'),
 ('spark.master', 'yarn'),
 ('spark.ui.port', '0'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.rpc.message.maxSize', '512'),
 ('spark.executorEnv.OPENBLAS_NUM_THREADS', '1'),
 ('spark.submit.pyFiles', ''),
 ('spark.yarn.isPython', 'true'),
 ('spark.dynamicAllocation.enabled', 'true'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.history.fs.logDirectory',
  'gs://dataproc-temp-europe-west1-907569830041-jsgvqmyn/0255e376-31c9-4b52-8e63-a4fe6188eba3/spark-job-history'),
 ('spark.sql.cbo.enabled', 'true')]

I don't understand why the parameter spark.executor.memory is set to 12859m when I have 32g PER WORKER, same goes for spark.executor.cores , set to 4 when each of my worker has 8 cores.我不明白为什么当我有32g PER WORKER 时参数spark.executor.memory设置为12859mspark.executor.cores也是如此,当我的每个工人都有8个核心时设置为4

Is it normal to use to few resources or should I set it when starting my sparkSession?使用少量资源是否正常,或者我应该在启动我的 sparkSession 时设置它? The code I use for now is:我现在使用的代码是:

spark = SparkSession \
    .builder \
    .appName('my_app') \
    .getOrCreate()

I read something about yarn.nodemanager.resource.memory-mb but I am not sure if it applies to PySpark clusters.我读了一些关于yarn.nodemanager.resource.memory-mb内容,但我不确定它是否适用于 PySpark 集群。

Thank you in advance for your help预先感谢您的帮助

Edit: To add more context, I am trying to read 10M+ Json files from Google Cloud Storage, and whatever I try I end up with OOM Error from the JVM, is there something I can set specifically to solve that kind of problem?编辑:为了添加更多上下文,我正在尝试从 Google Cloud Storage 读取 10M+ Json 文件,无论我尝试什么,我最终都会遇到来自 JVM 的 OOM 错误,有什么我可以专门设置来解决这类问题吗?

Ideally you can use upto 75 to 80 percentage of your resources in a single executor.理想情况下,您可以在单个执行程序中使用多达 75% 到 80% 的资源。 Lets say you have an executor of 8 cores and 16GB RAM - you can use around 6 cores and 12GB RAM for spark (leaving remaining resource for other operations like OS, mem alloc etc for that VM or pod).假设您有一个 8 核和 16GB RAM 的执行程序 - 您可以使用大约 6 个核和 12GB RAM 来运行 spark(将剩余资源留给该 VM 或 pod 的其他操作,如 OS、mem alloc 等)。

This doc has details around how to size executors for spark - https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html#:~:text=Leave%201%20core%20per%20node, )%20%3D%20150%2F5%20%3D%2030 This doc has details around how to size executors for spark - https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html#:~:text=Leave%201%20core%20per%20node, )%20%3D %20150%2F5%20%3D%2030

You can use those params in your spark config - --num-executors, --executor-cores and --executor-memory and you can play around with your spark job and see which config and infra suits your usecase.您可以在您的 spark 配置中使用这些参数 - --num-executors, --executor-cores and --executor-memory ,您可以使用您的 spark 作业并查看哪些配置和基础设施适合您的用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 通过 CloudFormation 在启动模板中具有容量预留的 EKS 托管节点组不使用容量预留 - EKS Managed Nodegroup with Capacity Reservation in Launch Template through CloudFormation does not use Capacity Reservation 您能否在 AWS Glue 中使用 PySpark 而不是 Glue PySpark? - Can you use PySpark instead of Glue PySpark in AWS Glue? 在 Synapse 中声明 Pyspark 变量并在 Kusto 查询中使用它 - Declare Pyspark variable in Synapse and use it in Kusto query 在不下载完整文件的情况下使用 FFMPEG - Use FFMPEG without downloading full file 如何在我的 pyspark 代码中导入和使用 spark-dynamodb - how to import and use spark-dynamodb in my pyspark code Power BI 能力 - Power BI Capacity AWS ECS 集群容量提供程序 - AWS ECS cluster Capacity Provider 大型 BigQuery 上的 DBT 增量实现使用完整扫描 - DBT incremental materialization on huge BigQuery use a full scan 如何从 S3 读取 a.txt 文件并将生成的 dataframe 用作 pyspark 中的 SQL 查询 - How to read a .txt file from S3 and use the resulting dataframe as a SQL query in pyspark 创建容量提供程序时出现 AWS ECS 错误 - AWS ECS Error When Creating Capacity Provider
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM