简体   繁体   中英

Running Apache spark job from Spring Web application using Yarn client or any alternate way

I have recently started using spark and I want to run spark job from Spring web application.

I have a situation where I am running web application in Tomcat server using Spring boot.My web application receives a REST web service request based on that It needs to trigger spark calculation job in Yarn cluster. Since my job can take longer to run and can access data from HDFS, so I want to run the spark job in yarn-cluster mode and I don't want to keep spark context alive in my web layer. One other reason for this is my application is multi tenant so each tenant can run it's own job, so in yarn-cluster mode each tenant's job can start it's own driver and run in it's own spark cluster. In web app JVM, I assume I can't run multiple spark context in one JVM.

I want to trigger spark jobs in yarn-cluster mode from java program in the my web application. what is the best way to achieve this. I am exploring various options and looking your guidance on which one is best

1) I can use spark-submit command line shell to submit my jobs. But to trigger it from my web application I need to use either Java ProcessBuilder api or some package built on java ProcessBuilder. This has 2 issues. First it doesn't sound like a clean way of doing it. I should have a programatic way of triggering my spark applications. Second problem will be I will loose the capability of monitoring the submitted application and getting it's status.. Only crude way of doing it is reading the output stream of spark-submit shell, which again doesn't sound like good approach.

2) I tried using Yarn client to submit the job from spring application. Following is the code that I use to submit spark job using Yarn Client:

Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");        
SparkConf conf = new SparkConf();
ClientArguments cArgs = new ClientArguments(sparkArgs, conf);
Client client = new Client(cArgs, config, conf);
client.run();

But when I run the above code, it tries to connect on localhost only. I get this error:

5/08/05 14:06:10 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/08/05 14:06:12 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

So I don't think it can connect to remote machine.

Please suggest, what is best way of doing this with latest version of spark. Later I have plans to deploy this entire application in amazon EMR. So approach should work there also.

Thanks in advance

Spark JobServer might help: https://github.com/spark-jobserver/spark-jobserver , this project receives RESTful web requests and start a spark job. Results is returned as json response.

I also had similar issues trying to run Spark app that connects to YARN cluster - having no cluster config it was trying to connect to the local machine as for the main node of the cluster, which obviously failed.

It worked for me when I've placed core-site.xml and yarn-site.xml into the classpath (src/main/resources in typical sbt or Maven project structure) - application correctly connected to the cluster.

When using spark-submit location of those files is typically specified by HADOOP_CONF_DIR environment variable, but for stand-alone application it didn't have effect.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM