使用pyspark连接PostgreSQL

Question

I am trying to connect to a database with pyspark and I am using the following code:我正在尝试使用 pyspark 连接到数据库，并且正在使用以下代码：

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql://[hostname]/[database]",
    dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
    password = "MichaelJordan",
    user =  "ScottyPippen",
    source = "jdbc",
    driver = "org.postgresql.Driver"
)

and I am getting the following error:我收到以下错误：

Any idea why is this happening?知道为什么会这样吗？

Edit : I am trying to run the code locally in my computer.编辑：我试图在我的计算机上本地运行代码。

Answer 1

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序

Then replace the database configuration values by yours.然后用你的替换数据库配置值。

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info:https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html更多信息：https ://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Answer 2

The following worked for me with postgres on localhost:以下为我在本地主机上使用 postgres 工作：

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html .从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序。

For the pyspark shell you use the SPARK_CLASSPATH environment variable:对于pyspark shell，您使用SPARK_CLASSPATH环境变量：

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:要通过spark-submit提交脚本，请使用--driver-class-path标志：

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:在 python 脚本中，将表作为DataFrame ，如下所示：

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:或者：

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit , you need to define the sqlContext .请注意，通过spark-submit提交脚本时，您需要定义sqlContext 。

Answer 3

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars在所有节点中都需要复制 postgresql-42.1.4.jar ......就我而言，我确实复制了路径 /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )另外，我在 ~/.bashrc 中设置了类路径（导出 SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" ）

and work fine in pyspark console and jupyter并在 pyspark 控制台和 jupyter 中正常工作

Answer 4

You normally need either:您通常需要：

to install the Postgres Driver on your cluster,在您的集群上安装 Postgres 驱动程序，
to provide the Postgres driver jar from your client with the --jars option使用 --jars 选项从您的客户端提供 Postgres 驱动程序 jar
or to provide the maven coordinates of the Postgres driver with --packages option.或者使用 --packages 选项提供 Postgres 驱动程序的 maven 坐标。

If you detail how are you launching pyspark, we may give you more details.如果您详细说明如何启动 pyspark，我们可能会为您提供更多详细信息。

Some clues/ideas:一些线索/想法：

spark-cannot-find-the-postgres-jdbc-driver spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell 无法在 pyspark shell 中使用 jdbc 连接到 postgres

Answer 5

One approach, building on the example per the quick start guide , is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.一种基于快速入门指南示例的方法是这篇博客文章，它展示了如何将--packages org.postgresql:postgresql:9.4.1211参数添加到spark-submit命令。

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar .这~/.ivy2/jars驱动程序下载到~/.ivy2/jars目录中，在我的例子中是/Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar 。 Passing this as the --driver-class-path option gives the full spark-submit command of:将此作为--driver-class-path选项传递给完整的 spark-submit 命令：

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py :在main.py ：

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()

Answer 6

To use pyspark and jupyter notebook notebook: first open pyspark with要使用 pyspark 和 jupyter notebook notebook：首先打开 pyspark

pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar

Then in jupyter notebook然后在 jupyter notebook 中

import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"


from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

Answer 7

I had trouble to get a connection to the postgresDB with the jars i had on my computer.我无法使用计算机上的 jar 连接到 postgresDB。 This code solved my problem with the driver此代码解决了我的驱动程序问题

 from pyspark.sql import SparkSession
 import os

 sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

 spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", sparkClassPath) \
    .getOrCreate()

 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "yourtablename") \
    .option("user", "postgres") \
    .option("password", "***") \
    .load()

df.show()

Answer 8

This exception means jdbc driver does not in driver classpath.此异常意味着 jdbc 驱动程序不在驱动程序类路径中。 you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath .您可以使用--jar参数 spark-submit jdbc jars，也可以使用spark.driver.extraClassPath将其添加到驱动程序类路径中。

Answer 9

I also get this error我也收到这个错误

java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getDriver(Unknown Source)

and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.并加入一个项目.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar')在SparkSession -奏效。

eg:例如：

from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession

spark = SparkSession \
    .builder \
    .appName('Python Spark Postgresql') \
    .config("spark.jars", "./postgresql-42.2.18.jar") \
    .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
    .getOrCreate()


df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/abc") \
    .option("dbtable", 'tablename') \
    .option("user", "postgres") \
    .option("password", "1") \
    .load()

df.printSchema()

Answer 10

Just initialize pyspark with --jars <path/to/your/jdbc.jar>只需使用--jars <path/to/your/jdbc.jar>初始化 pyspark

Eg: pyspark --jars /path/Downloads/postgresql-42.2.16.jar例如： pyspark --jars /path/Downloads/postgresql-42.2.16.jar

then create a dataframe as suggested above in other answers然后按照上面其他答案中的建议创建一个数据框

Eg:例如：

df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()

Answer 11

Download postgresql jar from here :从这里下载 postgresql jar：
Add this to ~Spark/jars/ folder.将此添加到 ~Spark/jars/ 文件夹。
Restart your kernel.重启你的内核。 It should work.它应该工作。

使用pyspark连接PostgreSQL

问题描述

11 个解决方案

解决方案1
26 2019-05-23 16:15:55

解决方案2
18 2016-05-05 19:41:52

解决方案3
6 2017-11-13 16:20:14

解决方案4
4 2016-01-22 14:47:04

解决方案5
3 2016-11-26 19:14:00

解决方案6
2 2020-04-22 08:22:25

解决方案7
2 2020-05-07 16:54:13

解决方案8
1 2016-01-22 14:51:08

解决方案9
1 2020-10-26 07:55:18

解决方案10
0 2020-09-22 10:22:01

解决方案11
0 2021-06-30 13:12:08

使用pyspark连接PostgreSQL

问题描述

11 个解决方案

解决方案1 26 2019-05-23 16:15:55

解决方案2 18 2016-05-05 19:41:52

解决方案3 6 2017-11-13 16:20:14

解决方案4 4 2016-01-22 14:47:04

解决方案5 3 2016-11-26 19:14:00

解决方案6 2 2020-04-22 08:22:25

解决方案7 2 2020-05-07 16:54:13

解决方案8 1 2016-01-22 14:51:08

解决方案9 1 2020-10-26 07:55:18

解决方案10 0 2020-09-22 10:22:01

解决方案11 0 2021-06-30 13:12:08

解决方案1
26 2019-05-23 16:15:55

解决方案2
18 2016-05-05 19:41:52

解决方案3
6 2017-11-13 16:20:14

解决方案4
4 2016-01-22 14:47:04

解决方案5
3 2016-11-26 19:14:00

解决方案6
2 2020-04-22 08:22:25

解决方案7
2 2020-05-07 16:54:13

解决方案8
1 2016-01-22 14:51:08

解决方案9
1 2020-10-26 07:55:18

解决方案10
0 2020-09-22 10:22:01

解决方案11
0 2021-06-30 13:12:08