简体   繁体   English

无法使用pyspark shell中的jdbc连接到postgres

[英]Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code - 我在本地窗口上使用独立群集,并尝试使用以下代码从我们的服务器加载数据 -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename")

I have set the SPARK_CLASSPATH as - 我已将SPARK_CLASSPATH设置为 -

os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar"

While executing sqlContext.load, it throws error mentioning "No suitable driver found for jdbc:postgresql". 在执行sqlContext.load时,它会引发错误,提到“找不到适合jdbc:postgresql的驱动程序”。 I have tried searching web, but not able to find solution. 我试过搜索网页,但无法找到解决方案。

May be it will be helpful. 可能会有所帮助。

In my environment SPARK_CLASSPATH contains path to postgresql connector 在我的环境中,SPARK_CLASSPATH包含postgresql连接器的路径

from pyspark import SparkContext, SparkConf
from pyspark.sql import DataFrameReader, SQLContext
import os

sparkClassPath = os.getenv('SPARK_CLASSPATH', '/path/to/connector/postgresql-42.1.4.jar')

# Populate configuration
conf = SparkConf()
conf.setAppName('application')
conf.set('spark.jars', 'file:%s' % sparkClassPath)
conf.set('spark.executor.extraClassPath', sparkClassPath)
conf.set('spark.driver.extraClassPath', sparkClassPath)
# Uncomment line below and modify ip address if you need to use cluster on different IP address
#conf.set('spark.master', 'spark://127.0.0.1:7077')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = 'postgresql://127.0.0.1:5432/postgresql'
properties = {'user':'username', 'password':'password'}

df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties)

df.printSchema()
df.show()

This piece of code allows to use pyspark where you need. 这段代码允许在您需要的地方使用pyspark。 For example, I've used it in Django project. 例如,我在Django项目中使用过它。

I had the same problem with mysql, and was never able to get it to work with the SPARK_CLASSPATH approach. 我遇到了与mysql相同的问题,并且永远无法使用SPARK_CLASSPATH方法。 However I did get it to work with extra command line arguments, see the answer to this question 但是我确实让它使用了额外的命令行参数,请参阅这个问题的答案

To avoid having to click through to get it working, here's what you have to do: 为了避免点击以使其工作,这是你必须做的:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM