使用 simba JDBC 从 pyspark 连接到 BigQuery

Question

Update the question 6/21更新问题 6/21

Background about Simba: The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.关于 Simba 的背景：Simba Google BigQuery JDBC 连接器在名为 SimbaBigQueryJDBC42-[Version].zip 的 ZIP 归档文件中交付，其中 [Version number] 是连接器的版本号。 The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.该存档包含支持 JDBC API 版本的连接器，该版本在存档名称中指示，以及发行说明和第三方许可证信息。

I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success.我正在尝试使用 simba jdbc 从 pyspark (docker) 连接到 BigQuery，但没有成功。 I had reviewed many posts here but couldn't find clue我在这里查看了很多帖子，但找不到线索

my code which I just submit from VC within spark docker image我刚刚在 spark docker 图像中从 VC 提交的代码

import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob

my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)

sc_conf = SparkConf()
sc_conf.setAppName("testApp") 
sc_conf.setMaster('local[*]') 
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)


spark = SparkSession \
    .builder \
    .master('local') \
    .appName('spark-read-from-bigquery') \
    .config("spark.executor.extraClassPath",my_jar_str) \
    .config("spark.driver.extraClassPath",my_jar_str) \
    .config("spark.jars", my_jar_str)\
    .getOrCreate()

myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0, 
            ProjectId='ProjectId', 
            OAuthServiceAcctEmail="etl@dProjectId.iam.gserviceaccount.com",
            OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")


pgDF = spark.read \
    .format("jdbc") \
    .option("url", myJDBC) \
    .option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
    .option("dbtable", my_query) \
    .load()

I'm getting error:我收到错误：

 File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

Is that missing jars or it is wrong logic?是缺少 jars 还是逻辑错误？ Please any clue is appreciated请任何线索表示赞赏

Answer 1

To anyone who might have the same thought.对于任何可能有相同想法的人。 I just found that SIMBA is not supporting spark but rather I have to follow the steps inhttps://github.com/GoogleCloudDataproc/spark-bigquery-connector .我刚刚发现 SIMBA 不支持 spark，而是我必须按照https://github.com/GoogleCloudDataproc/spark-bigquery-connector中的步骤操作。

The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars我不使用 Dataproc 而是使用独立 spark 的未解决问题（截至 6 月 23 日），因此我需要弄清楚如何收集一致的支持 jar

Answer 2

If ODBC also works for you, maybe this can help.如果 ODBC 也适用于您，也许这会有所帮助。 First, download and configure the ODBC driver from here :首先，从这里下载并配置 ODBC 驱动程序：

Next - use the connection like this (note the IgnoreTransactions parameter):接下来 - 使用这样的连接（注意 IgnoreTransactions 参数）：

import pyodbc
import pandas as pd

conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')

qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)

Answer 3

I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"

jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"

spark = SparkSession.builder.getOrCreate()

jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())

df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

使用 simba JDBC 从 pyspark 连接到 BigQuery

问题描述

3 个解决方案

解决方案1
0 已采纳 2021-06-23 19:14:06

解决方案2
0 2022-05-18 11:11:57

解决方案3
0 2022-08-26 06:29:37

使用 simba JDBC 从 pyspark 连接到 BigQuery

问题描述

3 个解决方案

解决方案1 0 已采纳 2021-06-23 19:14:06

解决方案2 0 2022-05-18 11:11:57

解决方案3 0 2022-08-26 06:29:37

解决方案1
0 已采纳 2021-06-23 19:14:06

解决方案2
0 2022-05-18 11:11:57

解决方案3
0 2022-08-26 06:29:37