简体   繁体   English

在何处定义Spark Java中要广播的对象

[英]Where to define the Object to Broadcast in Spark Java

I have a database object which is used to insert data from all Spark executors. 我有一个数据库对象,该对象用于从所有Spark执行程序插入数据。 When I define this object as static , it has a null value in those executors. 当我将此对象定义为static ,在那些执行程序中它具有null值。 So I declare it in the driver, broadcast it then get its value in each executor. 所以我在驱动程序中声明它,广播它,然后在每个执行程序中获取它的值。 When I run the application, the following exception is thrown: 当我运行该应用程序时,将引发以下异常:

Exception in thread "main" java.io.NotSerializableException: database.Database

Notes: 笔记:

  • The executors class is Serializable executors类是可序列化的
  • The broadcast object is defined as transient in that class 广播对象在该类中定义为瞬态
  • I removed the transient but it didn't work 我删除了瞬变,但是没有用

I interpret your question this way: 我这样解释您的问题:

I want to insert data from my RDD from all Spark executors. 我想从所有Spark执行程序的RDD中插入数据。 I tried to create one DB connection on the Driver and pass it somehow as a Broadcast to the executors, but Spark keeps throwing NotSerializableException . 我试图在驱动程序上创建一个数据库连接,并以某种方式将其作为广播传递给执行者,但是Spark不断抛出NotSerializableException How can I achieve my goal? 我如何实现我的目标?

The short answer is: 简短的答案是:

You should create a new connection on every executor node separately. 您应该在每个执行程序节点上分别创建一个新连接。
You should not pass database connection handlers, file handlers and the likes to other processes and especially remote machines. 您不应将数据库连接处理程序,文件处理程序等传递给其他进程,尤其是远程计算机。

The problem here is where exactly to create database connections, because with large number of executors one can easily exceed connection pool size of the DB. 这里的问题是在哪里确切创建数据库连接,因为使用大量执行程序,很容易超出数据库的连接池大小。

What you can actually do is to use foreachPartition , like here: 您实际上可以做的是使用foreachPartition ,如下所示:

  // numPartitions == number of simultaneous DB connections you can afford
  yourRdd.repartition(numPartitions)
  .foreachPartition {
    iter =>
      val connection = createConnection()
      while (iter.hasNext) {
        connection.execute("INSERT ...")
      }
      connection.commit()
  }

Here the code inside .foreachPartition will be executed on each executor machine, and connection objects will not be sent over the network, you won't have serialization exceptions and the data will be inserted. 在此, .foreachPartition的代码将在每台执行器机器上执行,并且连接对象将不会通过网络发送,您将不会有序列化异常,并且将插入数据。

The same reasoning about use of foreachPartition is also mentioned in the answers to this question. 有关使用同样的道理foreachPartition也是在回答中提到这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM