简体   繁体   English

使用 simba JDBC 从 pyspark 连接到 BigQuery

[英]Connect to BigQuery from pyspark using simba JDBC

Update the question 6/21更新问题 6/21

Background about Simba: The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.关于 Simba 的背景:Simba Google BigQuery JDBC 连接器在名为 SimbaBigQueryJDBC42-[Version].zip 的 ZIP 归档文件中交付,其中 [Version number] 是连接器的版本号。 The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.该存档包含支持 JDBC API 版本的连接器,该版本在存档名称中指示,以及发行说明和第三方许可证信息。

I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success.我正在尝试使用 simba jdbc 从 pyspark (docker) 连接到 BigQuery,但没有成功。 I had reviewed many posts here but couldn't find clue我在这里查看了很多帖子,但找不到线索

my code which I just submit from VC within spark docker image我刚刚在 spark docker 图像中从 VC 提交的代码

import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob

my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)

sc_conf = SparkConf()
sc_conf.setAppName("testApp") 
sc_conf.setMaster('local[*]') 
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)


spark = SparkSession \
    .builder \
    .master('local') \
    .appName('spark-read-from-bigquery') \
    .config("spark.executor.extraClassPath",my_jar_str) \
    .config("spark.driver.extraClassPath",my_jar_str) \
    .config("spark.jars", my_jar_str)\
    .getOrCreate()

myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0, 
            ProjectId='ProjectId', 
            OAuthServiceAcctEmail="etl@dProjectId.iam.gserviceaccount.com",
            OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")


pgDF = spark.read \
    .format("jdbc") \
    .option("url", myJDBC) \
    .option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
    .option("dbtable", my_query) \
    .load()


I'm getting error:我收到错误:

 File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

Is that missing jars or it is wrong logic?是缺少 jars 还是逻辑错误? Please any clue is appreciated请任何线索表示赞赏

To anyone who might have the same thought.对于任何可能有相同想法的人。 I just found that SIMBA is not supporting spark but rather I have to follow the steps inhttps://github.com/GoogleCloudDataproc/spark-bigquery-connector .我刚刚发现 SIMBA 不支持 spark,而是我必须按照https://github.com/GoogleCloudDataproc/spark-bigquery-connector中的步骤操作。

The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars我不使用 Dataproc 而是使用独立 spark 的未解决问题(截至 6 月 23 日),因此我需要弄清楚如何收集一致的支持 jar

If ODBC also works for you, maybe this can help.如果 ODBC 也适用于您,也许这会有所帮助。 First, download and configure the ODBC driver from here :首先,从这里下载并配置 ODBC 驱动程序:

Next - use the connection like this (note the IgnoreTransactions parameter):接下来 - 使用这样的连接(注意 IgnoreTransactions 参数):

import pyodbc
import pandas as pd

conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')

qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)

I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery

from pyspark.sql import SparkSession
from py4j.java_gateway import java_import

user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"

jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"

spark = SparkSession.builder.getOrCreate()

jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())

df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 BigQuery - Simba JDBC 连接错误无法识别的字符转义“S” - BigQuery - Simba JDBC Connection Error Unrecognized character escape 'S' 如何使用 Simba 连接到 Amazon Athena ODBC - How to connect to Amazon Athena using Simba ODBC 使用 PySpark 从 BigQuery 读取和写入数据:错误 `Failed to find data source: bigquery` - Reading and writing data from BigQuery, using PySpark: ERROR `Failed to find data source: bigquery` 使用 Databricks 将数据写入 Bigquery 时出错 Pyspark - Error writing data to Bigquery using Databricks Pyspark 使用 Pyspark 连接到 Amazon Aurora - Connect to Amazon Aurora using Pyspark 从 Databricks 将 pyspark df 写入 BigQuery 时出错 - Error when writing pyspark df to BigQuery from Databricks 如何在 DataGrip 中使用我的 gmail 帐户连接到 BigQuery 实例? - How to connect to a BigQuery instance using my gmail account in DataGrip? 无法从 BigQuery 作业连接到不同项目中的 Cloud SQL Postgres - Unable to connect from BigQuery job to Cloud SQL Postgres in different project 使用数据流将数据从 MySQL 加载到 BigQuery - Load data from MySQL to BigQuery using Dataflow 无法使用 Apache-Beam JDBC 连接到 Cloud SQL - Cannot connect to Cloud SQL using Apache-Beam JDBC
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM