如何使用scala将postgreSQL数据库连接到Apache Spark？

Question

I want to know how can I do following things in scala? 我想知道如何在scala中执行以下操作？

Connect to a postgreSQL database using Spark scala. 使用Spark scala连接到postgreSQL数据库。
Write SQL queries like SELECT , UPDATE etc. to modify a table in that database. 编写SELECT，UPDATE等SQL查询来修改该数据库中的表。

I know to do it using scala but how to import the connector jar of psql scala into sbt while packaging it? 我知道使用scala但是如何在打包时将psql scala的连接器jar导入到sbt中？

Answer 1

Our goal is to run parallel SQL queries from the Spark workers. 我们的目标是从Spark工作者运行并行SQL查询。

Build setup 构建设置

Add the connector and JDBC to the libraryDependencies in build.sbt . 将连接器和JDBC添加到libraryDependencies中的build.sbt 。 I've only tried this with MySQL, so I'll use that in my examples, but Postgres should be much the same. 我只用MySQL试过这个，所以我会在我的例子中使用它，但Postgres应该是一样的。

libraryDependencies ++= Seq(
  jdbc,
  "mysql" % "mysql-connector-java" % "5.1.29",
  "org.apache.spark" %% "spark-core" % "1.0.1",
  // etc
)

Code 码

When you create the SparkContext you tell it which jars to copy to the executors. 创建SparkContext ，告诉它SparkContext哪些jar复制到执行程序。 Include the connector jar. 包括连接器jar。 A good-looking way to do this: 一个好看的方式来做到这一点：

val classes = Seq(
  getClass,                   // To get the jar with our own code.
  classOf[mysql.jdbc.Driver]  // To get the connector.
)
val jars = classes.map(_.getProtectionDomain().getCodeSource().getLocation().getPath())
val conf = new SparkConf().setJars(jars)

Now Spark is ready to connect to the database. 现在Spark已准备好连接到数据库。 Each executor will run part of the query, so that the results are ready for distributed computation. 每个执行程序都将运行部分查询，以便结果可以进行分布式计算。

There are two options for this. 有两种选择。 The older approach is to use org.apache.spark.rdd.JdbcRDD : 较旧的方法是使用org.apache.spark.rdd.JdbcRDD ：

val rdd = new org.apache.spark.rdd.JdbcRDD(
  sc,
  () => {
    sql.DriverManager.getConnection("jdbc:mysql://mysql.example.com/?user=batman&password=alfred")
  },
  "SELECT * FROM BOOKS WHERE ? <= KEY AND KEY <= ?",
  0, 1000, 10,
  row => row.getString("BOOK_TITLE")
)

Check out the documentation for the parameters. 查看参数文档。 Briefly: 简述：

You have the SparkContext . 你有SparkContext 。
Then a function that creates the connection. 然后是一个创建连接的函数。 This will be called on each worker to connect to the database. 这将在每个worker上调用以连接到数据库。
Then the SQL query. 然后是SQL查询。 This has to be similar to the example, and contain placeholders for the starting and ending key. 这必须与示例类似，并包含起始键和结束键的占位符。
Then you specify the range of keys (0 to 1000 in my example) and the number of partitions. 然后指定键的范围（在我的示例中为0到1000）和分区数。 The range will be divided among the partitions. 范围将在分区之间划分。 So one executor thread will end up executing SELECT * FROM FOO WHERE 0 <= KEY AND KEY <= 100 in the example. 因此，一个执行程序线程将在示例中最终执行SELECT * FROM FOO WHERE 0 <= KEY AND KEY <= 100 。
And at last we have a function that converts the ResultSet into something. 最后我们有一个将ResultSet转换为某个东西的函数。 In the example we convert it into a String , so you end up with an RDD[String] . 在示例中，我们将其转换为String ，因此您最终得到一个RDD[String] 。

Since Apache Spark version 1.3.0 another method is available through the DataFrame API. 自Apache Spark版本1.3.0起，另一种方法可通过DataFrame API获得。 Instead of the JdbcRDD you would create an org.apache.spark.sql.DataFrame : 您将创建一个org.apache.spark.sql.DataFrame而不是JdbcRDD ：

val df = sqlContext.load("jdbc", Map(
  "url" -> "jdbc:mysql://mysql.example.com/?user=batman&password=alfred",
  "dbtable" -> "BOOKS"))

See https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases for the full list of options (the key range and number of partitions can be set just like with JdbcRDD ). 有关选项的完整列表，请参阅https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases （可以设置密钥范围和分区数量）和JdbcRDD ）。

Updates 更新

JdbcRDD does not support updates. JdbcRDD不支持更新。 But you can simply do them in a foreachPartition . 但你可以在foreachPartition简单地完成它们。

rdd.foreachPartition { it =>
  val conn = sql.DriverManager.getConnection("jdbc:mysql://mysql.example.com/?user=batman&password=alfred")
  val del = conn.prepareStatement("DELETE FROM BOOKS WHERE BOOK_TITLE = ?")
  for (bookTitle <- it) {
    del.setString(1, bookTitle)
    del.executeUpdate
  }
}

(This creates one connection per partition. If that is a concern, use a connection pool!) （这会为每个分区创建一个连接。如果这是一个问题，请使用连接池！）

DataFrame s support updates through the createJDBCTable and insertIntoJDBC methods. DataFrame通过createJDBCTable和insertIntoJDBC方法支持更新。

如何使用scala将postgreSQL数据库连接到Apache Spark？

问题描述

1 个解决方案

解决方案1
43 已采纳 2014-07-24 09:08:13

Build setup 构建设置

Code 码

Updates 更新

如何使用scala将postgreSQL数据库连接到Apache Spark？

问题描述

1 个解决方案

解决方案1 43 已采纳 2014-07-24 09:08:13

Build setup 构建设置

Code 码

Updates 更新

解决方案1
43 已采纳 2014-07-24 09:08:13