I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?
This tutorial assumes the following:
After some attempts, I've discovered how to debug on your local machine a DataProc Spark Job running on a cluster.
As you may know, you can submit a Spark Job either by using the Web UI, sending a request to the DataProc API or using the gcloud dataproc jobs submit spark
command. Whichever way, you start by adding the following key-value pair to the properties field in the SparkJob : spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=REMOTE_PORT
, where REMOTE_PORT
is the port on the worker where the driver will be listening.
Chances are your cluster is on a private network and you need to create a SSH tunnel to the REMOTE_PORT. If that's not the case, you're lucky and you just need to connect to the worker using the public IP and the specified REMOTE_PORT on your IDE.
Using IntelliJ it would be like this:
where worker-ip is the worker which is listening (I've used 9094 as port this time). After a few attempts, I realized it's always the worker number 0, but you can connect to it and check whether there is a process running using netstat -tulnp | grep REMOTE_PORT
netstat -tulnp | grep REMOTE_PORT
If for whatever reason your cluster does not have a public IP, you need to set a SSH tunnel from your local machine to the worker. After specifying your ZONE and PROJECT you create a tunnel to REMOTE_PORT:
gcloud compute ssh CLUSTER_NAME-w-0 --project=$PROJECT --zone=$ZONE -- -4 -N -L LOCAL_PORT:CLUSTER_NAME-w-0:REMOTE_PORT
And you set your debug configuration on your IDE pointing to host=localhost/127.0.0.1
and port=LOCAL_PORT
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.