Spark jdbc reuse connection

Question

In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.

 Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);

and use map operation on dfResult dataset.

While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.

May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?

Thanks

Answer 1

Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.

# pseudo-code
rdd.foreachPartition(iterator =>
  connection = SomeAPI.connect()
  for i in iterator:
    connection.insert(i)
)

And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.

Here's another example of this pattern being recommended:

http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs ( Slide 27 )

I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.

Spark jdbc reuse connection

Question

1 answers

solution1
3 2016-10-12 21:28:06

Spark jdbc reuse connection

Question

1 answers

solution1 3 2016-10-12 21:28:06

solution1
3 2016-10-12 21:28:06