简体   繁体   English

无法使用 Spark Structured Streaming 覆盖“spark.sql.shuffle.partitions”的默认值

[英]Unable to overwrite default value of "spark.sql.shuffle.partitions" with Spark Structured Streaming

I want to overwrite the spark.sql.shuffle.partitions parameter directly within the code:我想直接在代码中覆盖spark.sql.shuffle.partitions参数:

val sparkSession = SparkSession
  .builder()
  .appName("SPARK")
  .getOrCreate()

sparkSession.conf.set("spark.sql.shuffle.partitions", 2)

But this setting does not take effect since in the logs I get the following warning message:但是此设置没有生效,因为在日志中我收到以下警告消息:

WARN  OffsetSeqMetadata:66 - Updating the value of conf 'spark.sql.shuffle.partitions' in current session from '2' to '200'.

While the same parameter passed in a spark-submit shell works:虽然在spark-submit shell 中传递的相同参数有效:

#!/bin/bash

/app/spark-2/bin/spark-submit \
--queue root.dev \
--master yarn \
--deploy-mode cluster \
--driver-memory 5G \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 4 \
--conf spark.app.name=SPARK \
--conf spark.executor.memoryOverhead=2048 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.sql.shuffle.partitions=2 \
--class com.dev.MainClass

Any ideas?有任何想法吗?

In the checkpoint files of your Spark Structured Streaming job, some of the sparkSession configurations are stored.在 Spark 结构化流作业的检查点文件中,存储了一些sparkSession配置。

For example, in the folder "offset" the content for the latest batch could look like:例如,在“offset”文件夹中,最新批次的内容可能如下所示:

v1
{"batchWatermarkMs":0,"batchTimestampMs":1619782960476,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
4

Among others, it stores the value of the configuration spark.sql.shuffle.partitions , which in my example is set to the default value of 200.其中,它存储配置spark.sql.shuffle.partitions的值,在我的示例中设置为默认值 200。

In the code you will see, that this configuration value gets replaced in case it is available in the metadata of your checkpoint files.在您将看到的代码中,如果该配置值在检查点文件的元数据中可用,则会被替换。

In case you really have to change the partitions either remove all your checkpoint files or change the value manually to 2 in the last checkpoint files.如果您确实必须更改分区,请删除所有检查点文件或将最后一个检查点文件中的值手动更改为 2。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM