如何序列化Elastic4s ElasticSearch Client以与Spark RDD一起运行？

Question

Currently I am running Spark Mllib ALS on million of users and products and as with following code due to high shuffle to disk, collect step take more time as compare to recommendProductsForUsers step. 目前，我在数百万个用户和产品上运行Spark Mllib ALS，并且由于以下原因，由于磁盘高度随机播放，所以与以下代码一样，收集步骤要花更长的时间，而不是commitupProductsForUsers步骤。 So if I can somehow remove collect step and feed data directly from executors to elasticsearch then lot of time and computing resources will be saved. 因此，如果我能以某种方式删除收集步骤并将数据直接从执行程序中馈送到Elasticsearch，那么将节省大量时间和计算资源。

import com.sksamuel.elastic4s.ElasticClient
import com.sksamuel.elastic4s.ElasticDsl._
import org.elasticsearch.common.settings.ImmutableSettings


val settings = ImmutableSettings.settingsBuilder().put("cluster.name", "MYCLUSTER").build()
val client = ElasticClient.remote(settings, "11.11.11.11", 9300)
var ESMap = Map[String, List[String]]()
  val topKReco = bestModel.get
  // below step take 3 hours
  .recommendProductsForUsers(30)
  // below step takes 6 hours
  .collect()
  .foreach { r =>
  var i = 1
  var curr_user = r._1
  r._2.foreach { r2 =>
  item_ids(r2.product))
    ESMap += i.toString -> List(r2.product.toString)
    i += 1
  }
  client.execute {
    index into "recommendations1" / "items" id curr_user fields ESMap
  }.await
}

So now when I run this code without collect step I get following error : 所以现在当我运行此代码而不执行收集步骤时，出现以下错误：

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:869)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:868)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
    at org.apache.spark.rdd.RDD.foreach(RDD.scala:868)
    at CatalogALS2$.main(CatalogALS2.scala:157)
    at CatalogALS2.main(CatalogALS2.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.io.NotSerializableException:      com.sksamuel.elastic4s.ElasticClient
Serialization stack:
    - object not serializable (class: com.sksamuel.elastic4s.ElasticClient,     value: com.sksamuel.elastic4s.ElasticClient@e4c4af)
    - field (class: CatalogALS2$$anonfun$2, name: client$1, type: class    com.sksamuel.elastic4s.ElasticClient)
    - object (class CatalogALS2$$anonfun$2, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)

So What I understand from this is, If somehow I can serialise com.sksamuel.elastic4s.ElasticClient Class then I can run this task parallelly without collecting data to the driver. 因此，我从中了解的是，如果我可以以某种方式序列化com.sksamuel.elastic4s.ElasticClient类，那么我可以并行运行此任务而无需将数据收集到驱动程序。 If I generalise this problem, then how can I serialise any class or function in scala to be operated on RDD ?? 如果我概括这个问题，那么我如何序列化scala中的任何类或函数以在RDD上进行操作？

Answer 1

Found an answer for the same by using serialization like : 通过使用类似的序列化找到了相同的答案：

object ESConnection extends Serializable {

  //    Elasticsearch Client intiation
  val settings = ImmutableSettings.settingsBuilder().put("cluster.name", "MyCluster").build()
  lazy val client = ElasticClient.remote(settings, "11.11.11.11", 9300)

}

Then you can use it over RDD on executor without actually collecting data to driver as: 然后，您可以通过executor上的RDD使用它，而无需实际将数据收集到驱动程序，如下所示：

   val topKReco = bestModel.get
      .recommendProductsForUsers(30)
      // no collect required now
      .foreach { r =>
      var i = 1
      var curr_user = r._1

      r._2.foreach { r2 =>
      ESMap += i.toString -> List(r2.product.toString, item_ids(r2.product))
        i += 1
      }
      ESConnection.client.execute {
        index into "recommendation1" / "items" id curr_user fields ESMap
      }.await

    }

Answer 2

In continuation to Suraj's Answer 继续苏拉杰的回答

You should add the below dependency to the classpath for using ElasticClient class 您应该将以下依赖项添加到类路径中以使用ElasticClient类

// https://mvnrepository.com/artifact/com.sksamuel.elastic4s/elastic4s
libraryDependencies += "com.sksamuel.elastic4s" % "elastic4s" % "0.90.2.8"

如何序列化Elastic4s ElasticSearch Client以与Spark RDD一起运行？

问题描述

2 个解决方案

解决方案1
0 已采纳 2015-08-13 08:43:28

解决方案2
0 2016-11-03 07:57:20

如何序列化Elastic4s ElasticSearch Client以与Spark RDD一起运行？

问题描述

2 个解决方案

解决方案1 0 已采纳 2015-08-13 08:43:28

解决方案2 0 2016-11-03 07:57:20

解决方案1
0 已采纳 2015-08-13 08:43:28

解决方案2
0 2016-11-03 07:57:20