简体   繁体   English

PySpark 带 BiqQuery 连接器。 找不到数据源:bigquery

[英]PySpark with BiqQuery connector. Failed to find data source: bigquery

I need to write PySpark's result to BiqQuery.我需要将 PySpark 的结果写入 BiqQuery。 According tohttps://github.com/GoogleCloudDataproc/spark-bigquery-connector , i use following:根据https://github.com/GoogleCloudDataproc/spark-bigquery-connector ,我使用以下内容:


    from pyspark.sql import SparkSession

    spark = SparkSession.builder\
            .config("spark.jars.packages",\
                "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1")\
            .getOrCreate()
    spark_context = spark.sparkContext

Each attemp of saving,每一次储蓄尝试,


    data.toDF(schema) \
                .write.format("bigquery") \
                .option("table", "tmp-project:tmpdataset.tmp_table") \
                .save()

leads to an exception:导致异常:

*java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html*

Also tried, but same result:也尝试过,但结果相同:

  1. setup reference directly to gcs 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'直接设置参考 gcs 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'
  2. download locally 'spark-bigquery-latest_2.12.jar' and setup local path.在本地下载“spark-bigquery-latest_2.12.jar”并设置本地路径。 The file according to logs definitely exists.根据日志的文件肯定存在。
  3. option to setup the jar as argument like pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar is not available for right now.将 jar 设置为参数的选项,例如pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar目前不可用。
  4. Updated the format from "bigquery" to "com.google.cloud.spark.bigquery".将格式从“bigquery”更新为“com.google.cloud.spark.bigquery”。

PySpark version 3.0.0 scala version 2.12.10 PySpark 版本 3.0.0 scala 版本 2.12.10

The code below returns an empty result:下面的代码返回一个空结果:

[spark_context._jsc.sc().jars().apply(i) for i in range(jvc.sc().jars().length())]

UPD Upgrade spark to 3.1.1, using using local downloaded jar with chmod 777 under it changed behavior, but have not solve the issue yet: UPD将 spark 升级到 3.1.1,使用本地下载的 jar 和chmod 777改变了行为,但还没有解决问题:

spark_context._jsc.sc().listJars()

returns Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)返回Vector(spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

spark_context._jsc.sc().jars()

returns ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)返回ArrayBuffer(./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar)

Also new log appeared:还出现了新日志:

SparkContext: Added JAR ./<...>/spark-bigquery-with-dependencies_2.13-0.27.1.jar at spark://<...>.svc:<port>/jars/spark-bigquery-with-dependencies_2.13-0.27.1.jar with timestamp <timestamp>

The solution was to copy all jar files to /opt/spark/jars/ .解决方案是将所有 jar 文件复制到/opt/spark/jars/ Just keeping this file locally on docker container and load in runtime did not help, but moving to exact this path - helped.仅将此文件保存在本地 docker 容器中并在运行时加载没有帮助,但移动到确切的路径 - 有帮助。 If anyone else will bump with this issue, you can try to tune SPARK_CLASSPATH env - it might also help.如果其他人遇到这个问题,您可以尝试调整SPARK_CLASSPATH env - 它也可能有帮助。 SPARK_CLASSPATH was referenced to /opt/spark/jars/ in my case.在我的例子中, SPARK_CLASSPATH被引用到/opt/spark/jars/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 PySpark 从 BigQuery 读取和写入数据:错误 `Failed to find data source: bigquery` - Reading and writing data from BigQuery, using PySpark: ERROR `Failed to find data source: bigquery` 将 Google Data Studio 社区连接器与 BigQuery 结合使用时的时间戳查询问题 - Timestamp query issue when using Google Data Studio community connector with BigQuery GA4流量源数据与bigquery不匹配 - GA4 traffic source data do not match with bigquery Excel 的 BigQuery 连接器:参数类型不正确 - BigQuery Connector for Excel : Incorrect parameter type 检查表是否存在:Spark bigquery connector - Check if table exists: Spark bigquery connector Pyspark 将 dataframe 写入 bigquery [错误 gs] - Pyspark write dataframe to bigquery [error gs] java.lang.ClassNotFoundException:找不到数据源:hudi。 请在 http://spark.apache.org/third-party-projects.html 找到包 - java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html 使用 simba JDBC 从 pyspark 连接到 BigQuery - Connect to BigQuery from pyspark using simba JDBC Google PubSub 到 Kafka 源连接器 - Google PubSub to Kafka Source connector 在 BigQuery 中查找模式 - Find the mode in BigQuery
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM