[英]PySpark with BiqQuery connector. Failed to find data source: bigquery
I need to write PySpark's result to BiqQuery.我需要将 PySpark 的结果写入 BiqQuery。 According tohttps://github.com/GoogleCloudDataproc/spark-bigquery-connector , i use following:
根据https://github.com/GoogleCloudDataproc/spark-bigquery-connector ,我使用以下内容:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.jars.packages",\
"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1")\
.getOrCreate()
spark_context = spark.sparkContext
Each attemp of saving,每一次储蓄尝试,
data.toDF(schema) \
.write.format("bigquery") \
.option("table", "tmp-project:tmpdataset.tmp_table") \
.save()
leads to an exception:导致异常:
*java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html*
Also tried, but same result:也尝试过,但结果相同:
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar
is not available for right now.pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar
目前不可用。 PySpark version 3.0.0 scala version 2.12.10 PySpark 版本 3.0.0 scala 版本 2.12.10
The code below returns an empty result:下面的代码返回一个空结果:
[spark_context._jsc.sc().jars().apply(i) for i in range(jvc.sc().jars().length())]
UPD Upgrade spark to 3.1.1, using using local downloaded jar with chmod 777
under it changed behavior, but have not solve the issue yet: UPD将 spark 升级到 3.1.1,使用本地下载的 jar 和
chmod 777
改变了行为,但还没有解决问题:
spark_context._jsc.sc().listJars()
returns Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
返回
Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
spark_context._jsc.sc().jars()
returns ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
返回
ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)
Also new log appeared:还出现了新日志:
SparkContext: Added JAR ./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar at spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar with timestamp <timestamp>
The solution was to copy all jar files to /opt/spark/jars/
.解决方案是将所有 jar 文件复制到
/opt/spark/jars/
。 Just keeping this file locally on docker container and load in runtime did not help, but moving to exact this path - helped.仅将此文件保存在本地 docker 容器中并在运行时加载没有帮助,但移动到确切的路径 - 有帮助。 If anyone else will bump with this issue, you can try to tune
SPARK_CLASSPATH
env - it might also help.如果其他人遇到这个问题,您可以尝试调整
SPARK_CLASSPATH
env - 它也可能有帮助。 SPARK_CLASSPATH
was referenced to /opt/spark/jars/
in my case.在我的例子中,
SPARK_CLASSPATH
被引用到/opt/spark/jars/
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.