简体   繁体   English

在 Kubernetes 上运行 Spark 作业时,如何避免 Pod 的 DiskPressure 条件及其最终驱逐?

[英]How to avoid DiskPressure condition of pods and their eventual eviction while running Spark job on Kubernetes?

I want to re-partition a dataset and then write it on the destination path.我想重新分区数据集,然后将其写入目标路径。 However, my pods are getting evicted due to DiskPressure .但是,由于DiskPressure ,我的 pod 被驱逐了。 Spark only shows that it lost a worker but when I see the events in my OpenShift console, I see the that the pod(worker) was evicted. Spark 只显示它失去了一个工人,但是当我在我的 OpenShift 控制台中看到这些events时,我看到 pod(worker) 被驱逐了。

Here is how I am re-partitioning:这是我重新分区的方式:

df = df.repartition("created_year", "created_month", "created_day")
df.write.partitionBy("created_year", "created_month", "created_day").mode("overwrite").parquet(dest_path)

There are around 38k partitions:大约有 38k 个分区:

Job Id  ▾
Description
Submitted
Duration
Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
1   
parquet at NativeMethodAccessorImpl.java:0
(kill)parquet at NativeMethodAccessorImpl.java:0    2020/08/11 21:35:46 1.5 h   0/2 
2166/38281 (5633 failed)

Spark configurations are as follows: Spark配置如下:

def create_spark_config(spark_cluster, executor_memory='16g', executor_cores='4', max_cores='16'):
    print('Spark cluster is: {}'.format(spark_cluster))
    sc_conf = (
        pyspark.SparkConf().setMaster(spark_cluster) \
        .set('spark.driver.host', HOSTNAME) \
        .set('spark.driver.port', 42000) \
        .set('spark.driver.bindAddress', '0.0.0.0') \
        .set('spark.driver.blockManager.port', 42100) \
        .set('spark.executor.memory', '5G') \
        .set('spark.driver.memory', '3G') \
        .set('spark.sql.parquet.enableVectorizedReader', True) \
        .set('spark.sql.files.ignoreCorruptFiles', True)
    )
    return sc_conf

I am not able to figure out what is causing the DiskPressure, and how can I stop it?我无法弄清楚导致 DiskPressure 的原因,我该如何阻止它?

I read some answers and articles about DiskPressure and its handling but they were more generic and not pertaining to Spark.我阅读了一些关于 DiskPressure 及其处理的答案和文章,但它们更通用,与 Spark 无关。

Spark has 6 workers, each with 5GB of memory and 6 cores. Spark 有 6 个工作人员,每个工作人员有 5GB 的 memory 和 6 个内核。

DiskPressure is a case where disk usage of the containers increases over a large margin such that node on which pod is running faces crunch of disk availability. DiskPressure 是容器的磁盘使用量大幅增加的情况,因此运行 pod 的节点面临磁盘可用性的紧缩。 This crunch would be something like <5-10% of total availability.这种紧缩可能会小于总可用性的 5-10%

In such an event, kubelet sets the DiskPressure status on the node(which inturn is not-ready for scheduling) so newer pods are not scheduled and pods are evicted(which are re-scheduled to other availability) to meet uptime of pods.在这种情况下,kubelet 会在节点上设置 DiskPressure 状态(该节点还没有准备好调度),因此新的 pod 不会被调度并且会被驱逐(重新调度到其他可用性)以满足 pod 的正常运行时间。

Most common cases of facing diskpressure is missing log rotation(debug logs) and other case would be large data being written on a node with limited disk.面临磁盘压力的最常见情况是缺少日志轮换(调试日志),其他情况是在磁盘有限的节点上写入大量数据。

Edit: My answer is generic and not specific to spark scenario.编辑:我的回答是通用的,并不特定于火花场景。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从在Kubernetes上运行的Spark集群(2.1)查询hdfs? - How to query hdfs from a spark cluster (2.1) which is running on kubernetes? kubernetes pod 中的负载平衡 python 网络服务器 - Loadbalancing python webserver in kubernetes pods SyntaxError:运行Job时语法无效 - SyntaxError: invalid syntax while running Job 在 Kube.netes cron 作业中运行的应用程序未连接到同一 Kube.netes 集群中的数据库 - Application running in Kubernetes cron job does not connect to database in same Kubernetes cluster 在使用成功完成的 Spark 作业重新分区时,如何确保完整的数据已重新分区? - How to ensure complete data has been re-partitioned while re-partitioning with a successful completed Spark job? 由于 python 版本,运行 PySpark DataProc 作业时出错 - Error while running PySpark DataProc Job due to python version 我需要使用 kubernetes python 客户端在 Kubernetes 集群中获取 Pod 的数量 - I need to get number of Pods in a Kubernetes Cluster with kubernetes python client 在 function 运行时用户单击 GUI 后避免 GUI 崩溃? - Avoid GUI crash after user clicks GUI while a function is running? 运行python程序以使用kafka进行流媒体播放时出错 - Error while running python program for spark streaming with kafka 如何在使用 python 的 Spark SQL 作业中避免 \r\n? - How to avoid \r\n in Spark SQL Jobs using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM