简体   繁体   中英

Spark jdbc reuse connection

In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.

 Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);

and use map operation on dfResult dataset.

While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.

May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?

Thanks

Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.

# pseudo-code
rdd.foreachPartition(iterator =>
  connection = SomeAPI.connect()
  for i in iterator:
    connection.insert(i)
)

And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.

Here's another example of this pattern being recommended:

在此处输入图片说明

http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs ( Slide 27 )

I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM