[英]Using pyspark to connect to PostgreSQL
I am trying to connect to a database with pyspark and I am using the following code:我正在尝试使用 pyspark 连接到数据库,并且正在使用以下代码:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql://[hostname]/[database]",
dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
password = "MichaelJordan",
user = "ScottyPippen",
source = "jdbc",
driver = "org.postgresql.Driver"
)
and I am getting the following error:我收到以下错误:
Any idea why is this happening?知道为什么会这样吗?
Edit : I am trying to run the code locally in my computer.编辑:我试图在我的计算机上本地运行代码。
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序
Then replace the database configuration values by yours.然后用你的替换数据库配置值。
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
More info:https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html更多信息:https ://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
The following worked for me with postgres on localhost:以下为我在本地主机上使用 postgres 工作:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html .从https://jdbc.postgresql.org/download.html下载 PostgreSQL JDBC 驱动程序。
For the pyspark
shell you use the SPARK_CLASSPATH
environment variable:对于
pyspark
shell,您使用SPARK_CLASSPATH
环境变量:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit
use the --driver-class-path
flag:要通过
spark-submit
提交脚本,请使用--driver-class-path
标志:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame
as follows:在 python 脚本中,将表作为
DataFrame
,如下所示:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:或者:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit
, you need to define the sqlContext
.请注意,通过
spark-submit
提交脚本时,您需要定义sqlContext
。
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars在所有节点中都需要复制 postgresql-42.1.4.jar ......就我而言,我确实复制了路径 /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )另外,我在 ~/.bashrc 中设置了类路径(导出 SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter并在 pyspark 控制台和 jupyter 中正常工作
You normally need either:您通常需要:
If you detail how are you launching pyspark, we may give you more details.如果您详细说明如何启动 pyspark,我们可能会为您提供更多详细信息。
Some clues/ideas:一些线索/想法:
spark-cannot-find-the-postgres-jdbc-driver spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell 无法在 pyspark shell 中使用 jdbc 连接到 postgres
One approach, building on the example per the quick start guide , is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211
argument to the spark-submit
command.一种基于快速入门指南示例的方法是这篇博客文章,它展示了如何将
--packages org.postgresql:postgresql:9.4.1211
参数添加到spark-submit
命令。
This downloads the driver into ~/.ivy2/jars
directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
.这
~/.ivy2/jars
驱动程序下载到~/.ivy2/jars
目录中,在我的例子中是/Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
。 Passing this as the --driver-class-path
option gives the full spark-submit command of:将此作为
--driver-class-path
选项传递给完整的 spark-submit 命令:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py
:在
main.py
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()
To use pyspark and jupyter notebook notebook: first open pyspark with要使用 pyspark 和 jupyter notebook notebook:首先打开 pyspark
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook然后在 jupyter notebook 中
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
I had trouble to get a connection to the postgresDB with the jars i had on my computer.我无法使用计算机上的 jar 连接到 postgresDB。 This code solved my problem with the driver
此代码解决了我的驱动程序问题
from pyspark.sql import SparkSession
import os
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", sparkClassPath) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "yourtablename") \
.option("user", "postgres") \
.option("password", "***") \
.load()
df.show()
This exception means jdbc driver does not in driver classpath.此异常意味着 jdbc 驱动程序不在驱动程序类路径中。 you can spark-submit jdbc jars with
--jar
parameter, also add it into driver classpath using spark.driver.extraClassPath
.您可以使用
--jar
参数 spark-submit jdbc jars,也可以使用spark.driver.extraClassPath
将其添加到驱动程序类路径中。
I also get this error我也收到这个错误
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(Unknown Source)
and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar')
in SparkSession
- that worked.并加入一个项目
.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar')
在SparkSession
-奏效。
eg:例如:
from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName('Python Spark Postgresql') \
.config("spark.jars", "./postgresql-42.2.18.jar") \
.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/abc") \
.option("dbtable", 'tablename') \
.option("user", "postgres") \
.option("password", "1") \
.load()
df.printSchema()
Just initialize pyspark with --jars <path/to/your/jdbc.jar>
只需使用
--jars <path/to/your/jdbc.jar>
初始化 pyspark
Eg: pyspark --jars /path/Downloads/postgresql-42.2.16.jar
例如:
pyspark --jars /path/Downloads/postgresql-42.2.16.jar
then create a dataframe as suggested above in other answers然后按照上面其他答案中的建议创建一个数据框
Eg:例如:
df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.