简体   繁体   中英

PySpark jdbc predicates error: Py4JError: An error occurred while calling o108.jdbc

I'm trying to use predicates in my DataFrameReader.jdbc() method:

df = sqlContext.read.jdbc(
    url="jdbc:db2://bluemix05.bluforcloud.com:50001/BLUDB:user=****;password=****;sslConnection=true;",  
    table="GOSALES.BRANCH",
    predicates=['WHERE BRANCH_CODE=5']
).cache()

However, I'm hitting the following error:

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
...

Py4JError: An error occurred while calling o108.jdbc. Trace:
py4j.Py4JException: Method jdbc([class java.lang.String, class java.lang.String, class [Ljava.lang.Object;, class java.util.Properties]) does not exist

How should I be adding predicates to the jdbc method call?

There at least two problems here. One looks like a PySpark bug and as far as I can tell is already solved in the current master.

Another problem is condition you use. It should be simply 'BRANCH_CODE = 5' not 'WHERE BRANCH_CODE = 5' .

Finally if you use only a single predicate it makes more sense to pass it as subquery like this:

df = sqlContext.read.jdbc( 
    url = url,
    table = "(SELECT * FROM GOSALES.BRANCH WHERE BRANCH_CODE=5) AS tmp")

JDBC query with predicates creates a single JDBC partition per predicate so it is much harder to tune. Not to mention you have to remember about possible duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM