Connect to BigQuery from pyspark using simba JDBC

Question

Update the question 6/21

Background about Simba: The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector. The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.

I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue

my code which I just submit from VC within spark docker image

import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob

my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)

sc_conf = SparkConf()
sc_conf.setAppName("testApp") 
sc_conf.setMaster('local[*]') 
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)


spark = SparkSession \
    .builder \
    .master('local') \
    .appName('spark-read-from-bigquery') \
    .config("spark.executor.extraClassPath",my_jar_str) \
    .config("spark.driver.extraClassPath",my_jar_str) \
    .config("spark.jars", my_jar_str)\
    .getOrCreate()

myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0, 
            ProjectId='ProjectId', 
            OAuthServiceAcctEmail="etl@dProjectId.iam.gserviceaccount.com",
            OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")


pgDF = spark.read \
    .format("jdbc") \
    .option("url", myJDBC) \
    .option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
    .option("dbtable", my_query) \
    .load()

I'm getting error:

 File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

Is that missing jars or it is wrong logic? Please any clue is appreciated

Answer 1

To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps inhttps://github.com/GoogleCloudDataproc/spark-bigquery-connector .

The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars

Answer 2

If ODBC also works for you, maybe this can help. First, download and configure the ODBC driver from here :

Next - use the connection like this (note the IgnoreTransactions parameter):

import pyodbc
import pandas as pd

conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')

qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)

Answer 3

I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"

jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"

spark = SparkSession.builder.getOrCreate()

jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())

df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

Connect to BigQuery from pyspark using simba JDBC

Question

3 answers

solution1
0 ACCPTED 2021-06-23 19:14:06

solution2
0 2022-05-18 11:11:57

solution3
0 2022-08-26 06:29:37

Connect to BigQuery from pyspark using simba JDBC

Question

3 answers

solution1 0 ACCPTED 2021-06-23 19:14:06

solution2 0 2022-05-18 11:11:57

solution3 0 2022-08-26 06:29:37

solution1
0 ACCPTED 2021-06-23 19:14:06

solution2
0 2022-05-18 11:11:57

solution3
0 2022-08-26 06:29:37