简体   繁体   中英

Deploy Apache Spark application from another application in Java, best practice

I am a new user of Spark. I have a web service that allows a user to request the server to perform a complex data analysis by reading from a database and pushing the results back to the database. I have moved those analysis's into various Spark applications. Currently I use spark-submit to deploy these applications.

However, I am curious, when my web server (written in Java) receives a user request, what is considered the "best practice" way to initiate the corresponding Spark application? Spark's documentation seems to be to use "spark-submit" but I would rather not pipe out the command to a terminal to perform this action. I saw an alternative, Spark-JobServer, which provides an RESTful interface to do exactly this, but my Spark applications are written in either Java or R, which seems to not interface well with Spark-JobServer.

Is there another best-practice to kickoff a spark application from a web server (in Java), and wait for a status result whether the job succeeded or failed?

Any ideas of what other people are doing to accomplish this would be very helpful! Thanks!

I've had a similar requirement. Here's what I did:

  1. To submit apps, I use the hidden Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api

  2. Using this same API you can query status for a Driver or you can Kill your Job later

  3. There's also another hidden UI Json API: http://[master-node]:[master-ui-port]/json/ which exposes all information available on the master UI in JSON format.

Using "Submission API" I submit a driver and using the "Master UI API" I wait until my Driver and App state are RUNNING

The web server can also act as the Spark driver. So it would have a SparkContext instance and contain the code for working with RDDs.

The advantage of this is that the Spark executors are long-lived. You save time by not having to start/stop them all the time. You can cache RDDs between operations.

A disadvantage is that since the executors are running all the time, they take up memory that other processes in the cluster could possibly use. Another one is that you cannot have more than one instance of the web server, since you cannot have more than one SparkContext to the same Spark application.

我们使用的是Spark Job-server,它在Java上运行良好,也只是构建了Java代码罐,并用Scala包装它以与Spark Job-Server一起使用。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM