PySpark 带 BiqQuery 连接器。找不到数据源：bigquery

Question

I need to write PySpark's result to BiqQuery.我需要将 PySpark 的结果写入 BiqQuery。 According tohttps://github.com/GoogleCloudDataproc/spark-bigquery-connector , i use following:根据https://github.com/GoogleCloudDataproc/spark-bigquery-connector ，我使用以下内容：


    from pyspark.sql import SparkSession

    spark = SparkSession.builder\
            .config("spark.jars.packages",\
                "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1")\
            .getOrCreate()
    spark_context = spark.sparkContext

Each attemp of saving,每一次储蓄尝试，


    data.toDF(schema) \
                .write.format("bigquery") \
                .option("table", "tmp-project:tmpdataset.tmp_table") \
                .save()

leads to an exception:导致异常：

*java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html*

Also tried, but same result:也尝试过，但结果相同：

setup reference directly to gcs 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'直接设置参考 gcs 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'
download locally 'spark-bigquery-latest_2.12.jar' and setup local path.在本地下载“spark-bigquery-latest_2.12.jar”并设置本地路径。 The file according to logs definitely exists.根据日志的文件肯定存在。
option to setup the jar as argument like pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar is not available for right now.将 jar 设置为参数的选项，例如pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar目前不可用。
Updated the format from "bigquery" to "com.google.cloud.spark.bigquery".将格式从“bigquery”更新为“com.google.cloud.spark.bigquery”。

PySpark version 3.0.0 scala version 2.12.10 PySpark 版本 3.0.0 scala 版本 2.12.10

The code below returns an empty result:下面的代码返回一个空结果：

[spark_context._jsc.sc().jars().apply(i) for i in range(jvc.sc().jars().length())]

UPD Upgrade spark to 3.1.1, using using local downloaded jar with chmod 777 under it changed behavior, but have not solve the issue yet: UPD将 spark 升级到 3.1.1，使用本地下载的 jar 和chmod 777改变了行为，但还没有解决问题：

spark_context._jsc.sc().listJars()

returns Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)返回Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

spark_context._jsc.sc().jars()

returns ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)返回ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

Also new log appeared:还出现了新日志：

SparkContext: Added JAR ./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar at spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar with timestamp <timestamp>

Answer 1

The solution was to copy all jar files to /opt/spark/jars/ .解决方案是将所有 jar 文件复制到/opt/spark/jars/ 。 Just keeping this file locally on docker container and load in runtime did not help, but moving to exact this path - helped.仅将此文件保存在本地 docker 容器中并在运行时加载没有帮助，但移动到确切的路径 - 有帮助。 If anyone else will bump with this issue, you can try to tune SPARK_CLASSPATH env - it might also help.如果其他人遇到这个问题，您可以尝试调整SPARK_CLASSPATH env - 它也可能有帮助。 SPARK_CLASSPATH was referenced to /opt/spark/jars/ in my case.在我的例子中， SPARK_CLASSPATH被引用到/opt/spark/jars/ 。

PySpark 带 BiqQuery 连接器。找不到数据源：bigquery

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-12-27 17:33:13

PySpark 带 BiqQuery 连接器。 找不到数据源：bigquery

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-12-27 17:33:13

PySpark 带 BiqQuery 连接器。找不到数据源：bigquery

解决方案1
0 已采纳 2022-12-27 17:33:13