Using Postgresql JDBC source with Apache Spark on EMR

Question

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

I tried the following:

Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

Edited spark-default.conf to include wildcard location

 spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$

Tried to create dataframe in Jupyter cell using the following code:

 SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password" spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})

I get a Java error as per below:

Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver

Help appreciated.

Answer 1

Check the github repo of the Driver . The class path seems to be something like this org.postgresql.Driver . Try using the same.

Answer 2

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

Download postgres driver jar :

cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar

Create dataframe :

atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
        .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
                 'database' : <db>,
                 'dbtable' : <select * from table>}
 df=spark.read.format('jdbc').options(**attribute).load()

Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

Using Postgresql JDBC source with Apache Spark on EMR

Question

2 answers

solution1
1 2019-01-12 16:19:33

solution2
1 ACCPTED 2019-01-14 19:23:08

Using Postgresql JDBC source with Apache Spark on EMR

Question

2 answers

solution1 1 2019-01-12 16:19:33

solution2 1 ACCPTED 2019-01-14 19:23:08

solution1
1 2019-01-12 16:19:33

solution2
1 ACCPTED 2019-01-14 19:23:08