简体   繁体   中英

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

I tried the following:

  1. Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

  2. Edited spark-default.conf to include wildcard location

     spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$ 
  3. Tried to create dataframe in Jupyter cell using the following code:

     SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password" spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'}) 

I get a Java error as per below:

Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver

Help appreciated.

Check the github repo of the Driver . The class path seems to be something like this org.postgresql.Driver . Try using the same.

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

Download postgres driver jar :

cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar

Create dataframe :

atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
        .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
                 'database' : <db>,
                 'dbtable' : <select * from table>}
 df=spark.read.format('jdbc').options(**attribute).load()

Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM