![](/img/trans.png)
[英]BigQuery - Simba JDBC Connection Error Unrecognized character escape 'S'
[英]Connect to BigQuery from pyspark using simba JDBC
更新問題 6/21
關於 Simba 的背景:Simba Google BigQuery JDBC 連接器在名為 SimbaBigQueryJDBC42-[Version].zip 的 ZIP 歸檔文件中交付,其中 [Version number] 是連接器的版本號。 該存檔包含支持 JDBC API 版本的連接器,該版本在存檔名稱中指示,以及發行說明和第三方許可證信息。
我正在嘗試使用 simba jdbc 從 pyspark (docker) 連接到 BigQuery,但沒有成功。 我在這里查看了很多帖子,但找不到線索
我剛剛在 spark docker 圖像中從 VC 提交的代碼
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl@dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
我收到錯誤:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
是缺少 jars 還是邏輯錯誤? 請任何線索表示贊賞
對於任何可能有相同想法的人。 我剛剛發現 SIMBA 不支持 spark,而是我必須按照https://github.com/GoogleCloudDataproc/spark-bigquery-connector中的步驟操作。
我不使用 Dataproc 而是使用獨立 spark 的未解決問題(截至 6 月 23 日),因此我需要弄清楚如何收集一致的支持 jar
如果 ODBC 也適用於您,也許這會有所幫助。 首先,從這里下載並配置 ODBC 驅動程序:
接下來 - 使用這樣的連接(注意 IgnoreTransactions 參數):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long And my solution is creating a jar file from java which include jdbc dialect https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.