简体   繁体   中英

Apache Spark : spark executor pod isn't able to pull docker image from a registry/repo

I'm new to Apache Spark.

I'm trying to run a spark session using pyspark . I have configured to have 2 executor nodes for it. Now both of the executor nodes needs to pull my custom built spark image which is in a repo.

Below is the configuration in python for my spark session/job

spark = SparkSession.builder.appName('sparkpi-test1'
).master("k8s://https://kubernetes.default:443"
).config("spark.kubernetes.container.image", "\<repo\>"
).config("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
).config("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token"
).config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-driver-0"
).config("spark.executor.instances", 2
).config("spark.driver.host", "test"
).config("spark.driver.port", "20020"
).config("spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config("spark.sql.hive.convertMetastoreParquet", "false"
).config("spark.jars.packages", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.1.2"
).config("spark.kubernetes.node.selector.testNodeCategory", "ondemand"
).getOrCreate()

sparkpi-test1-2341a185c8144b60-exec-1 0/1
ImagePullBackOff 0 5h17m sparkpi-test1-2341a185c8144b60-exec-2 0/1
ImagePullBackOff 0 5h17m

So, Correct me if I'm doing anything wrong. I'm trying to setup Spark in my existing kube.netes cluster using my custom built spark image in some repo. I mentioned the same in configuration in my python file.

).config("spark.kube.netes.container.image", "<repo>"

According to docs

Container image to use for the Spark application. This is usually of the form example.com/repo/spark:v1.0.0. This configuration is required and must be provided by the user, unless explicit images are provided for each different container type.

Why is my executor node failing to pull the image from registry? How do I pull it manually for executor node for the time being?

Just for reference Find the below error messages

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I guess the above error message is because my executor pods didn't create succesfully.

I've got it. I was using terraform to build all the resources. .tfstate file got changed and is causing the pods to have these errors

Clearing terraform cache got my problem solved.

To clean terraform cache run

rm -rf .terraform 

In your terraform directory

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM