Apache Spark : spark executor pod isn't able to pull docker image from a registry/repo

Question

I'm new to Apache Spark.

I'm trying to run a spark session using pyspark . I have configured to have 2 executor nodes for it. Now both of the executor nodes needs to pull my custom built spark image which is in a repo.

Below is the configuration in python for my spark session/job

spark = SparkSession.builder.appName('sparkpi-test1'
).master("k8s://https://kubernetes.default:443"
).config("spark.kubernetes.container.image", "\<repo\>"
).config("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
).config("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token"
).config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-driver-0"
).config("spark.executor.instances", 2
).config("spark.driver.host", "test"
).config("spark.driver.port", "20020"
).config("spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config("spark.sql.hive.convertMetastoreParquet", "false"
).config("spark.jars.packages", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.1.2"
).config("spark.kubernetes.node.selector.testNodeCategory", "ondemand"
).getOrCreate()

sparkpi-test1-2341a185c8144b60-exec-1 0/1
ImagePullBackOff 0 5h17m sparkpi-test1-2341a185c8144b60-exec-2 0/1
ImagePullBackOff 0 5h17m

So, Correct me if I'm doing anything wrong. I'm trying to setup Spark in my existing kube.netes cluster using my custom built spark image in some repo. I mentioned the same in configuration in my python file.

).config("spark.kube.netes.container.image", "<repo>"

According to docs

Container image to use for the Spark application. This is usually of the form example.com/repo/spark:v1.0.0. This configuration is required and must be provided by the user, unless explicit images are provided for each different container type.

Why is my executor node failing to pull the image from registry? How do I pull it manually for executor node for the time being?

Just for reference Find the below error messages

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I guess the above error message is because my executor pods didn't create succesfully.

Answer 1

I've got it. I was using terraform to build all the resources. .tfstate file got changed and is causing the pods to have these errors

Clearing terraform cache got my problem solved.

To clean terraform cache run

rm -rf .terraform

In your terraform directory

Apache Spark : spark executor pod isn't able to pull docker image from a registry/repo

Question

1 answers

solution1
0 2023-01-19 16:30:45

Apache Spark : spark executor pod isn't able to pull docker image from a registry/repo

Question

1 answers

solution1 0 2023-01-19 16:30:45

solution1
0 2023-01-19 16:30:45