简体   繁体   English

如何在Play for Scala中实现SparkContext

[英]How to implement SparkContext in Play for Scala

I have the following Play for Scala controller that wraps Spark. 我有以下包装了Spark的Play for Scala控制器。 At the end of the method I close the context to avoid the problem of having more than one context active in the same JVM: 在方法的最后,我关闭上下文以避免在同一JVM中激活多个上下文的问题:

class Test4 extends Controller  {

    def test4 = Action.async { request =>

        val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
                                   set("spark.executor.memory","1g");

        val sc = new SparkContext(conf)

        val rawData = sc.textFile("c:\\spark\\data.csv")        

        val data = rawData.map(line => line.split(',').map(_.toDouble))

        val str = "count: " + data.count()

        sc.close

        Future { Ok(str) }
     }
 }

The problem that I have is that I don't know how to make this code multi-threaded as two users may access the same controller method at the same time. 我遇到的问题是,我不知道如何使该代码成为多线程代码,因为两个用户可能会同时访问同一控制器方法。

UPDATE UPDATE

What I'm thinking is to have N Scala programs receive messages through JMS (using ActiveMQ). 我想让N个Scala程序通过JMS(使用ActiveMQ)接收消息。 Each Scala program would have a Spark session and receive messages from Play. 每个Scala程序都会有一个Spark会话,并从Play接收消息。 The Scala programs will process requests sequentially as they read the queues. Scala程序在读取队列时将顺序处理请求。 Does this make sense? 这有意义吗? Are there any other best practices to integrate Play and Spark? 还有其他整合Play和Spark的最佳作法吗?

Its better just move spark context move to new object 最好只是将spark上下文移至新对象

object SparkContext{
        val conf = new SparkConf().setAppName("AppTest").setMaster("local[2]").
                                   set("spark.executor.memory","1g");

        val sc = new SparkContext(conf)
}

Otherwise for every request new spark context is created according to your design and new JVM is started for each new spark context. 否则,将根据您的设计为每个请求创建新的spark上下文,并为每个新的spark上下文启动新的JVM。

If we talk about best practices its really not good idea to use spark inside play project more better way is to create a micro service which have spark application and play application call this micro service these type of architecture is more flexible, scalable, robust. 如果我们讨论最佳实践,那么在play项目中使用spark并不是一个好主意,更好的方法是创建一个具有spark应用程序的微服务,并将play应用程序称为此微服务,这些类型的体系结构更灵活,可扩展,更健壮。

I don't think is a good idea to execute Spark jobs from a REST api, if you just want to parallelize in your local JVM it doesn't make sense to use Spark since it is designed for distributed computing. 我认为从REST API执行Spark作业不是一个好主意,如果您只想在本地JVM中进行并行化,则使用Spark是没有意义的,因为它是为分布式计算而设计的。 It is also not design to be an operational database and it won't scale well when you execute several concurrent queries in the same cluster. 它也不是一个可操作的数据库,在同一集群中执行多个并发查询时,它的伸缩性也不佳。

Anyway if you still want to execute concurrent spark queries from the same JVM you should probably use client mode to run the query in a external cluster. 无论如何,如果您仍然想从同一JVM执行并发的火花查询,则可能应该使用客户端模式在外部集群中运行查询。 It is not possible to launch more than one session per JVM so I would suggest that you share the session in your service, close it just when you are finishing the service. 每个JVM不能启动一个以上的会话,因此我建议您共享服务中的会话,并在完成服务时将其关闭。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM