[英]How can I connect to a postgreSQL database into Apache Spark using scala?
I want to know how can I do following things in scala? 我想知道如何在scala中执行以下操作?
I know to do it using scala but how to import the connector jar of psql scala into sbt while packaging it? 我知道使用scala但是如何在打包时将psql scala的连接器jar导入到sbt中?
Our goal is to run parallel SQL queries from the Spark workers. 我们的目标是从Spark工作者运行并行SQL查询。
Add the connector and JDBC to the libraryDependencies
in build.sbt
. 将连接器和JDBC添加到
libraryDependencies
中的build.sbt
。 I've only tried this with MySQL, so I'll use that in my examples, but Postgres should be much the same. 我只用MySQL试过这个,所以我会在我的例子中使用它,但Postgres应该是一样的。
libraryDependencies ++= Seq(
jdbc,
"mysql" % "mysql-connector-java" % "5.1.29",
"org.apache.spark" %% "spark-core" % "1.0.1",
// etc
)
When you create the SparkContext
you tell it which jars to copy to the executors. 创建
SparkContext
,告诉它SparkContext
哪些jar复制到执行程序。 Include the connector jar. 包括连接器jar。 A good-looking way to do this:
一个好看的方式来做到这一点:
val classes = Seq(
getClass, // To get the jar with our own code.
classOf[mysql.jdbc.Driver] // To get the connector.
)
val jars = classes.map(_.getProtectionDomain().getCodeSource().getLocation().getPath())
val conf = new SparkConf().setJars(jars)
Now Spark is ready to connect to the database. 现在Spark已准备好连接到数据库。 Each executor will run part of the query, so that the results are ready for distributed computation.
每个执行程序都将运行部分查询,以便结果可以进行分布式计算。
There are two options for this. 有两种选择。 The older approach is to use
org.apache.spark.rdd.JdbcRDD
: 较旧的方法是使用
org.apache.spark.rdd.JdbcRDD
:
val rdd = new org.apache.spark.rdd.JdbcRDD(
sc,
() => {
sql.DriverManager.getConnection("jdbc:mysql://mysql.example.com/?user=batman&password=alfred")
},
"SELECT * FROM BOOKS WHERE ? <= KEY AND KEY <= ?",
0, 1000, 10,
row => row.getString("BOOK_TITLE")
)
Check out the documentation for the parameters. 查看参数文档。 Briefly:
简述:
SparkContext
. SparkContext
。 SELECT * FROM FOO WHERE 0 <= KEY AND KEY <= 100
in the example. SELECT * FROM FOO WHERE 0 <= KEY AND KEY <= 100
。 ResultSet
into something. ResultSet
转换为某个东西的函数。 In the example we convert it into a String
, so you end up with an RDD[String]
. String
,因此您最终得到一个RDD[String]
。 Since Apache Spark version 1.3.0 another method is available through the DataFrame API. 自Apache Spark版本1.3.0起,另一种方法可通过DataFrame API获得。 Instead of the
JdbcRDD
you would create an org.apache.spark.sql.DataFrame
: 您将创建一个
org.apache.spark.sql.DataFrame
而不是JdbcRDD
:
val df = sqlContext.load("jdbc", Map(
"url" -> "jdbc:mysql://mysql.example.com/?user=batman&password=alfred",
"dbtable" -> "BOOKS"))
See https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases for the full list of options (the key range and number of partitions can be set just like with JdbcRDD
). 有关选项的完整列表,请参阅https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases (可以设置密钥范围和分区数量)和
JdbcRDD
)。
JdbcRDD
does not support updates. JdbcRDD
不支持更新。 But you can simply do them in a foreachPartition
. 但你可以在
foreachPartition
简单地完成它们。
rdd.foreachPartition { it =>
val conn = sql.DriverManager.getConnection("jdbc:mysql://mysql.example.com/?user=batman&password=alfred")
val del = conn.prepareStatement("DELETE FROM BOOKS WHERE BOOK_TITLE = ?")
for (bookTitle <- it) {
del.setString(1, bookTitle)
del.executeUpdate
}
}
(This creates one connection per partition. If that is a concern, use a connection pool!) (这会为每个分区创建一个连接。如果这是一个问题,请使用连接池!)
DataFrame
s support updates through the createJDBCTable
and insertIntoJDBC
methods. DataFrame
通过createJDBCTable
和insertIntoJDBC
方法支持更新。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.